SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
0/10
International Workshop on Mining Software Repositories, Edinburgh, 25.05.2004




Preprocessing CVS Data
for Fine-Grained Analysis

Thomas Zimmermann 1 and Peter Weißgerber 2
1
    Saarland University
2
                                 ¨
    Catholic University of Eichstatt-Ingolstadt
Motivation                                                          1/10


Tom Ball et al. “If your version control system could talk. . . ”
So, why is my CVS so silent?

1. CVS has limited query functionality and is slow.
   ⇒ Copy CVS into a database
2. CVS splits up changes on multiple files.
   ⇒ Infer transactions
3. CVS knows only files—but what about functions?
   ⇒ Detect fine-grained changes
4. CVS contains unreliable data which is noise.
   ⇒ Clean data

Preprocessing is the key to a talkative version control system.
Copy CVS into a Database                                                                   2/10


                    !
           quot;


               #quot;
                         $%&!         '
                         $%&'             '

                         $%()     *
                        +,-$ $*     *.)
                            $+,-$ $*    *
                         $%(      *

               #/    quot;
                                )01                )0

           2222222222222222222222222222
                       !
                 )..0 .   % * 0( 0)1                     1        34 1         5   2
           6          #          )..0
           2222222222222222222222222222
                       '
                 )..% ) * ' )* %!1                       1        34 1         5 * 2)'
           0'.0.
           2222222222222222222222222222
                       *
                 )..% .* )' ' % )01                  1         34 1         5* 2
           quot;             * )1
           777     #             777
           2222222222222222222222222222

           2222222222222222222222222222
                       *)
                 )..0 .    ) & *%   1                  1         34 1         5 * 2)'
           8     /    93:,
           ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;



Create incremental copies with cvs rdiff -s or cvs status.
Infer Transactions: Time Windows                                    3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                           ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window
Infer Transactions: Time Windows                                    3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                           ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window




                           ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T
Sliding Time Window
Infer Transactions: Time Windows                                    3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                           ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window




                           ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T
Sliding Time Window
Infer Transactions: Time Windows                                    3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                           ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window




                           ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T
Sliding Time Window
Infer Transactions: Time Windows                                    3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                           ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window




                           ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T
Sliding Time Window
Infer Transactions: Time Windows                                       3/10


All changes by the same developer, with the same message,
made at the “same time” belong to one transaction.

                              ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T
Fixed Time Window




                              ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T
Sliding Time Window




All changed files within one transaction have to be different.
Infer Transactions: Commit Mails                                      4/10


All changes listed in a commit mail belong to one transaction.
CVSROOT: /cvs/gcc
Module name: gcc
Changes by: zack@gcc.gnu.org 2004-05-01 19:12:47

Modified files:
gcc/cp          : ChangeLog decl.c

Log message:
* decl.c (reshape_init): Do not apply TYPE_DOMAIN to a VECTOR_TYPE.
Instead, dig into the representation type to find the array bound.

Patches:
http://.../cvsweb.cgi/gcc/gcc/cp/ChangeLog.diff?...&r2=1.4042
http://.../cvsweb.cgi/gcc/gcc/cp/decl.c.diff?...&r2=1.1204

Commit mails for GCC: http://gcc.gnu.org/ml/gcc-cvs/
Not every project provides useful commit mails.
Infer Transactions: Evaluation                                    5/10


We inferred transactions for 3 years GCC using commit mails.

Maximal Duration of a Commit
  21:17 minutes for “merged with ra-merge-initial” (5,910 files)
  ⇒ Sliding time windows are superior to fixed ones.
Infer Transactions: Evaluation                                    5/10


We inferred transactions for 3 years GCC using commit mails.

Maximal Duration of a Commit
  21:17 minutes for “merged with ra-merge-initial” (5,910 files)
  ⇒ Sliding time windows are superior to fixed ones.
Maximal Distance between two subsequent Checkins
  Depends on file size, RCS file size, and # of revisions.
  For almost all files below 3:00 minutes. Two exceptions:
  gcc/libstdc++-v3/configure, gcc/gcc/ChangeLog
  ⇒ Time windows should be at least 3:00 minutes.
Infer Transactions: Evaluation                                    5/10


We inferred transactions for 3 years GCC using commit mails.

Maximal Duration of a Commit
  21:17 minutes for “merged with ra-merge-initial” (5,910 files)
  ⇒ Sliding time windows are superior to fixed ones.
Maximal Distance between two subsequent Checkins
  Depends on file size, RCS file size, and # of revisions.
  For almost all files below 3:00 minutes. Two exceptions:
  gcc/libstdc++-v3/configure, gcc/gcc/ChangeLog
  ⇒ Time windows should be at least 3:00 minutes.
Minimal Distance between two similar Commits
   Bad news: 0:02 minutes for “Mark ChangeLog”
   Good news: All similar commits were really related.
   ⇒ Time windows have no upper bound (no duplicate files!)
Detect Fine-Grained Changes                                       6/10


What building blocks (e.g., functions, classes, sections, etc.)
have been changed between two revisions?
Detect Fine-Grained Changes                                       6/10


What building blocks (e.g., functions, classes, sections, etc.)
have been changed between two revisions?
Detect Fine-Grained Changes                                       6/10


What building blocks (e.g., functions, classes, sections, etc.)
have been changed between two revisions?
Noise: Large Transactions                              7/10


Large transactions are usually outliers:

 • “Change #include filenames from <foo.h> [sigh] to
   <openssl.h>.” (552 files, OPENSSL)
 • “Change functions to ANSI C.” (491 files, OPENSSL)

Solution: Ignore all transactions with size above N.
Noise: Merge Transactions   8/10
Noise: Merge Transactions                            8/10




Merges are noise for two reasons:

1. Merges contain unrelated changes — e.g. B and C
2. Merges duplicate related changes — e.g. A and B
Noise: Merge Transactions                              9/10




Two Solutions:

 • The Fischer/Pinzger/Gall heuristic (ICSM 2003).
 • Suspect & Verify approach based on log messages.
  Problem:
  “New isMerge(), isMergeWithConflicts(), and . . . ”
Lessons Learned                                                10/10


5 Databases simplify the exploration of CVS.
5 Sliding time windows are superior to fixed ones.
5 Length of time windows should be within 3 and 5 minutes.
5 Fine-grained analyses are feasible and worth while.
5 Take a look at the ECLIPSE framework for comparing files:
  org.eclipse.compare.structuremergeviewer
5 Merges are dirty transactions and difficult to recognize.


 Preprocessing is the key to any good and reliable analysis.

Mais conteúdo relacionado

Semelhante a Preprocessing CVS Data for Fine-Grained Analysis

5 Steps to Mastering the Art of Seaside
5 Steps to Mastering the Art of Seaside5 Steps to Mastering the Art of Seaside
5 Steps to Mastering the Art of SeasideLukas Renggli
 
Identifying Hotspots in Software Build Processes
Identifying Hotspots in Software Build ProcessesIdentifying Hotspots in Software Build Processes
Identifying Hotspots in Software Build ProcessesShane McIntosh
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
The Ring programming language version 1.8 book - Part 14 of 202
The Ring programming language version 1.8 book - Part 14 of 202The Ring programming language version 1.8 book - Part 14 of 202
The Ring programming language version 1.8 book - Part 14 of 202Mahmoud Samir Fayed
 
Switch case and looping new
Switch case and looping newSwitch case and looping new
Switch case and looping newaprilyyy
 
My programming final proj. (1)
My programming final proj. (1)My programming final proj. (1)
My programming final proj. (1)aeden_brines
 
Blockchain Development
Blockchain DevelopmentBlockchain Development
Blockchain Developmentpreetikumara
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)Андрей Новиков
 
(Even more) Rapid App Development with RubyMotion
(Even more) Rapid App Development with RubyMotion(Even more) Rapid App Development with RubyMotion
(Even more) Rapid App Development with RubyMotionStefan Haflidason
 
Enhancing Mobile User Experience with WebSocket
Enhancing Mobile User Experience with WebSocketEnhancing Mobile User Experience with WebSocket
Enhancing Mobile User Experience with WebSocketMauricio "Maltron" Leal
 
Performance measurement and tuning
Performance measurement and tuningPerformance measurement and tuning
Performance measurement and tuningAOE
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner
 
Practical JavaScript Programming - Session 8/8
Practical JavaScript Programming - Session 8/8Practical JavaScript Programming - Session 8/8
Practical JavaScript Programming - Session 8/8Wilson Su
 
Stress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesStress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesHerval Freire
 
Macasu, gerrell c.
Macasu, gerrell c.Macasu, gerrell c.
Macasu, gerrell c.gerrell
 
Vert.x – The problem of real-time data binding
Vert.x – The problem of real-time data bindingVert.x – The problem of real-time data binding
Vert.x – The problem of real-time data bindingAlex Derkach
 

Semelhante a Preprocessing CVS Data for Fine-Grained Analysis (20)

Switch case and looping jam
Switch case and looping jamSwitch case and looping jam
Switch case and looping jam
 
5 Steps to Mastering the Art of Seaside
5 Steps to Mastering the Art of Seaside5 Steps to Mastering the Art of Seaside
5 Steps to Mastering the Art of Seaside
 
Le langage rust
Le langage rustLe langage rust
Le langage rust
 
Identifying Hotspots in Software Build Processes
Identifying Hotspots in Software Build ProcessesIdentifying Hotspots in Software Build Processes
Identifying Hotspots in Software Build Processes
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
The Ring programming language version 1.8 book - Part 14 of 202
The Ring programming language version 1.8 book - Part 14 of 202The Ring programming language version 1.8 book - Part 14 of 202
The Ring programming language version 1.8 book - Part 14 of 202
 
Switch case and looping new
Switch case and looping newSwitch case and looping new
Switch case and looping new
 
My programming final proj. (1)
My programming final proj. (1)My programming final proj. (1)
My programming final proj. (1)
 
Blockchain Development
Blockchain DevelopmentBlockchain Development
Blockchain Development
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
 
(Even more) Rapid App Development with RubyMotion
(Even more) Rapid App Development with RubyMotion(Even more) Rapid App Development with RubyMotion
(Even more) Rapid App Development with RubyMotion
 
Enhancing Mobile User Experience with WebSocket
Enhancing Mobile User Experience with WebSocketEnhancing Mobile User Experience with WebSocket
Enhancing Mobile User Experience with WebSocket
 
Performance measurement and tuning
Performance measurement and tuningPerformance measurement and tuning
Performance measurement and tuning
 
Time for Comet?
Time for Comet?Time for Comet?
Time for Comet?
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your Pipeline
 
Practical JavaScript Programming - Session 8/8
Practical JavaScript Programming - Session 8/8Practical JavaScript Programming - Session 8/8
Practical JavaScript Programming - Session 8/8
 
Stress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesStress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year Eves
 
Macasu, gerrell c.
Macasu, gerrell c.Macasu, gerrell c.
Macasu, gerrell c.
 
Vert.x – The problem of real-time data binding
Vert.x – The problem of real-time data bindingVert.x – The problem of real-time data binding
Vert.x – The problem of real-time data binding
 

Mais de Thomas Zimmermann

Software Analytics = Sharing Information
Software Analytics = Sharing InformationSoftware Analytics = Sharing Information
Software Analytics = Sharing InformationThomas Zimmermann
 
Predicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsPredicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsThomas Zimmermann
 
Analytics for smarter software development
Analytics for smarter software development Analytics for smarter software development
Analytics for smarter software development Thomas Zimmermann
 
Characterizing and Predicting Which Bugs Get Reopened
Characterizing and Predicting Which Bugs Get ReopenedCharacterizing and Predicting Which Bugs Get Reopened
Characterizing and Predicting Which Bugs Get ReopenedThomas Zimmermann
 
Data driven games user research
Data driven games user researchData driven games user research
Data driven games user researchThomas Zimmermann
 
Not my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsNot my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsThomas Zimmermann
 
Empirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchEmpirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchThomas Zimmermann
 
Security trend analysis with CVE topic models
Security trend analysis with CVE topic modelsSecurity trend analysis with CVE topic models
Security trend analysis with CVE topic modelsThomas Zimmermann
 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software developmentThomas Zimmermann
 
Characterizing and predicting which bugs get fixed
Characterizing and predicting which bugs get fixedCharacterizing and predicting which bugs get fixed
Characterizing and predicting which bugs get fixedThomas Zimmermann
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesThomas Zimmermann
 
Cross-project defect prediction
Cross-project defect predictionCross-project defect prediction
Cross-project defect predictionThomas Zimmermann
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesThomas Zimmermann
 
Predicting Defects using Network Analysis on Dependency Graphs
Predicting Defects using Network Analysis on Dependency GraphsPredicting Defects using Network Analysis on Dependency Graphs
Predicting Defects using Network Analysis on Dependency GraphsThomas Zimmermann
 
Quality of Bug Reports in Open Source
Quality of Bug Reports in Open SourceQuality of Bug Reports in Open Source
Quality of Bug Reports in Open SourceThomas Zimmermann
 
Predicting Subsystem Defects using Dependency Graph Complexities
Predicting Subsystem Defects using Dependency Graph Complexities Predicting Subsystem Defects using Dependency Graph Complexities
Predicting Subsystem Defects using Dependency Graph Complexities Thomas Zimmermann
 
Got Myth? Myths in Software Engineering
Got Myth? Myths in Software EngineeringGot Myth? Myths in Software Engineering
Got Myth? Myths in Software EngineeringThomas Zimmermann
 

Mais de Thomas Zimmermann (20)

Software Analytics = Sharing Information
Software Analytics = Sharing InformationSoftware Analytics = Sharing Information
Software Analytics = Sharing Information
 
MSR 2013 Preview
MSR 2013 PreviewMSR 2013 Preview
MSR 2013 Preview
 
Predicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsPredicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode Operations
 
Analytics for smarter software development
Analytics for smarter software development Analytics for smarter software development
Analytics for smarter software development
 
Characterizing and Predicting Which Bugs Get Reopened
Characterizing and Predicting Which Bugs Get ReopenedCharacterizing and Predicting Which Bugs Get Reopened
Characterizing and Predicting Which Bugs Get Reopened
 
Klingon Countdown Timer
Klingon Countdown TimerKlingon Countdown Timer
Klingon Countdown Timer
 
Data driven games user research
Data driven games user researchData driven games user research
Data driven games user research
 
Not my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignmentsNot my bug! Reasons for software bug report reassignments
Not my bug! Reasons for software bug report reassignments
 
Empirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchEmpirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft Research
 
Security trend analysis with CVE topic models
Security trend analysis with CVE topic modelsSecurity trend analysis with CVE topic models
Security trend analysis with CVE topic models
 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software development
 
Characterizing and predicting which bugs get fixed
Characterizing and predicting which bugs get fixedCharacterizing and predicting which bugs get fixed
Characterizing and predicting which bugs get fixed
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development Activities
 
Cross-project defect prediction
Cross-project defect predictionCross-project defect prediction
Cross-project defect prediction
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development Activities
 
Predicting Defects using Network Analysis on Dependency Graphs
Predicting Defects using Network Analysis on Dependency GraphsPredicting Defects using Network Analysis on Dependency Graphs
Predicting Defects using Network Analysis on Dependency Graphs
 
Quality of Bug Reports in Open Source
Quality of Bug Reports in Open SourceQuality of Bug Reports in Open Source
Quality of Bug Reports in Open Source
 
Meet Tom and his Fish
Meet Tom and his FishMeet Tom and his Fish
Meet Tom and his Fish
 
Predicting Subsystem Defects using Dependency Graph Complexities
Predicting Subsystem Defects using Dependency Graph Complexities Predicting Subsystem Defects using Dependency Graph Complexities
Predicting Subsystem Defects using Dependency Graph Complexities
 
Got Myth? Myths in Software Engineering
Got Myth? Myths in Software EngineeringGot Myth? Myths in Software Engineering
Got Myth? Myths in Software Engineering
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Preprocessing CVS Data for Fine-Grained Analysis

  • 1. 0/10 International Workshop on Mining Software Repositories, Edinburgh, 25.05.2004 Preprocessing CVS Data for Fine-Grained Analysis Thomas Zimmermann 1 and Peter Weißgerber 2 1 Saarland University 2 ¨ Catholic University of Eichstatt-Ingolstadt
  • 2. Motivation 1/10 Tom Ball et al. “If your version control system could talk. . . ” So, why is my CVS so silent? 1. CVS has limited query functionality and is slow. ⇒ Copy CVS into a database 2. CVS splits up changes on multiple files. ⇒ Infer transactions 3. CVS knows only files—but what about functions? ⇒ Detect fine-grained changes 4. CVS contains unreliable data which is noise. ⇒ Clean data Preprocessing is the key to a talkative version control system.
  • 3. Copy CVS into a Database 2/10 ! quot; #quot; $%&! ' $%&' ' $%() * +,-$ $* *.) $+,-$ $* * $%( * #/ quot; )01 )0 2222222222222222222222222222 ! )..0 . % * 0( 0)1 1 34 1 5 2 6 # )..0 2222222222222222222222222222 ' )..% ) * ' )* %!1 1 34 1 5 * 2)' 0'.0. 2222222222222222222222222222 * )..% .* )' ' % )01 1 34 1 5* 2 quot; * )1 777 # 777 2222222222222222222222222222 2222222222222222222222222222 *) )..0 . ) & *% 1 1 34 1 5 * 2)' 8 / 93:, ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Create incremental copies with cvs rdiff -s or cvs status.
  • 4. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window
  • 5. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T Sliding Time Window
  • 6. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T Sliding Time Window
  • 7. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T Sliding Time Window
  • 8. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T Sliding Time Window
  • 9. Infer Transactions: Time Windows 3/10 All changes by the same developer, with the same message, made at the “same time” belong to one transaction. ∀δi : ∀δj : |time(δi) − time(δj )| ≤ T Fixed Time Window ∀δi : ∃δj : |time(δi) − time(δj )| ≤ T Sliding Time Window All changed files within one transaction have to be different.
  • 10. Infer Transactions: Commit Mails 4/10 All changes listed in a commit mail belong to one transaction. CVSROOT: /cvs/gcc Module name: gcc Changes by: zack@gcc.gnu.org 2004-05-01 19:12:47 Modified files: gcc/cp : ChangeLog decl.c Log message: * decl.c (reshape_init): Do not apply TYPE_DOMAIN to a VECTOR_TYPE. Instead, dig into the representation type to find the array bound. Patches: http://.../cvsweb.cgi/gcc/gcc/cp/ChangeLog.diff?...&r2=1.4042 http://.../cvsweb.cgi/gcc/gcc/cp/decl.c.diff?...&r2=1.1204 Commit mails for GCC: http://gcc.gnu.org/ml/gcc-cvs/ Not every project provides useful commit mails.
  • 11. Infer Transactions: Evaluation 5/10 We inferred transactions for 3 years GCC using commit mails. Maximal Duration of a Commit 21:17 minutes for “merged with ra-merge-initial” (5,910 files) ⇒ Sliding time windows are superior to fixed ones.
  • 12. Infer Transactions: Evaluation 5/10 We inferred transactions for 3 years GCC using commit mails. Maximal Duration of a Commit 21:17 minutes for “merged with ra-merge-initial” (5,910 files) ⇒ Sliding time windows are superior to fixed ones. Maximal Distance between two subsequent Checkins Depends on file size, RCS file size, and # of revisions. For almost all files below 3:00 minutes. Two exceptions: gcc/libstdc++-v3/configure, gcc/gcc/ChangeLog ⇒ Time windows should be at least 3:00 minutes.
  • 13. Infer Transactions: Evaluation 5/10 We inferred transactions for 3 years GCC using commit mails. Maximal Duration of a Commit 21:17 minutes for “merged with ra-merge-initial” (5,910 files) ⇒ Sliding time windows are superior to fixed ones. Maximal Distance between two subsequent Checkins Depends on file size, RCS file size, and # of revisions. For almost all files below 3:00 minutes. Two exceptions: gcc/libstdc++-v3/configure, gcc/gcc/ChangeLog ⇒ Time windows should be at least 3:00 minutes. Minimal Distance between two similar Commits Bad news: 0:02 minutes for “Mark ChangeLog” Good news: All similar commits were really related. ⇒ Time windows have no upper bound (no duplicate files!)
  • 14. Detect Fine-Grained Changes 6/10 What building blocks (e.g., functions, classes, sections, etc.) have been changed between two revisions?
  • 15. Detect Fine-Grained Changes 6/10 What building blocks (e.g., functions, classes, sections, etc.) have been changed between two revisions?
  • 16. Detect Fine-Grained Changes 6/10 What building blocks (e.g., functions, classes, sections, etc.) have been changed between two revisions?
  • 17. Noise: Large Transactions 7/10 Large transactions are usually outliers: • “Change #include filenames from <foo.h> [sigh] to <openssl.h>.” (552 files, OPENSSL) • “Change functions to ANSI C.” (491 files, OPENSSL) Solution: Ignore all transactions with size above N.
  • 19. Noise: Merge Transactions 8/10 Merges are noise for two reasons: 1. Merges contain unrelated changes — e.g. B and C 2. Merges duplicate related changes — e.g. A and B
  • 20. Noise: Merge Transactions 9/10 Two Solutions: • The Fischer/Pinzger/Gall heuristic (ICSM 2003). • Suspect & Verify approach based on log messages. Problem: “New isMerge(), isMergeWithConflicts(), and . . . ”
  • 21. Lessons Learned 10/10 5 Databases simplify the exploration of CVS. 5 Sliding time windows are superior to fixed ones. 5 Length of time windows should be within 3 and 5 minutes. 5 Fine-grained analyses are feasible and worth while. 5 Take a look at the ECLIPSE framework for comparing files: org.eclipse.compare.structuremergeviewer 5 Merges are dirty transactions and difficult to recognize. Preprocessing is the key to any good and reliable analysis.