Slides for the data paper "Mining the Modern Code Review Repositories: A Dataset of People, Process and Product" in the proceedings of the 13th International Conference on Mining Software Repositories (MSR 2016), Austin, TX, May 2016.
Mining the Modern Code Review Repositories: A Dataset of People, Process and Product (MSR 2016)
1. Mining the Modern Code Review Repositories:
A Dataset of People, Process and Product
Xin Yang Raula G. Kula Norihiro Yoshida Hajimu Iida
May 14–15, 2016. Austin, Texas
MSR 2016 data showcase
Osaka University
Japan
Nagoya University
Japan
NAIST
Japan
NAIST
Japan
2. An Overview of the Code Review Dataset
1
● Code Review
● Source Code
● Human / Social
3. Why we made this dataset?
2
*Hamasaki et al., “Who does what during a code review? datasets of OSS peer review
repositories”. MSR '13
Our JSON-based
Dataset
(Hamasaki et al. MSR'13)*
4. Our previous work
(Hamasaki et al. MSR '13)*
Why we made this dataset?
2
*Hamasaki et al., “Who does what during a code review? datasets of OSS peer review
repositories”. MSR '13
Our JSON-based
Dataset
(Hamasaki et al. MSR'13)*
Some feedback:
“Hard to query...”
“Hard to convert...”
“Unable to access the source
code...”
5. Our previous work
(Hamasaki et al. MSR '13)*
Why we made this dataset?
2
*Hamasaki et al., “Who does what during a code review? datasets of OSS peer review
repositories”. MSR '13
Our JSON-based
Dataset
(Hamasaki et al. MSR'13)*
Some feedback:
“Hard to query...”
“Hard to convert...”
“Unable to access the source
code...”
Script
8. 4 years 3 years 7 years 4 years 3 years
611 20 567 111 189
173,749 13,597 63,610 110,172 9,168
5,091 437 3,334 1,437 759
Dataset Statistics (updated to May 2015)
4
</></></>
Why we made this dataset?
Code review dataset from 5 successful OSS projects
Source code from Git
Human and social information (anonymized usernames and email addresses)
Our previous work in MSR 2013 provides JSON format dataset and refined dataset with csv format.
In these 3 years we have received many feedback from our dataset users.
Some users complained that : …….
Thus, we improved our dataset by converting JSON to MySQL database, and provide shell scripts to access source code...
Our previous work in MSR 2013 provides JSON format dataset and refined dataset with csv format.
In these 3 years we have received many feedback from our dataset users.
Some users complained that : …….
Thus, we improved our dataset by converting JSON to MySQL database, and provide shell scripts to access source code...
Our previous work in MSR 2013 provides JSON format dataset and refined dataset with csv format.
In these 3 years we have received many feedback from our dataset users.
Some users complained that : …….
Thus, we improved our dataset by converting JSON to MySQL database, and provide shell scripts to access source code...
This is a typical MCR process,
Author create and update their patches (changes),
Reviewers perform code reviews on changes and send feedback to authors
Continuous Integration (CI) tools build and test changes,
After several times revisions, the changes will pass reviews and be integrated to code repositories
Our dataset try to retrieve the data from three different aspect of code review process.
First, how developers, reviewers and CI tools collaborate (see People)
Second, what is the life cycle of a change from initial commit to final decision (see Process)
Final, what is the product of code review (see Product).
Some basic statistics about our dataset
We retrieve data from 5 big-scale successful OSS projects:
OpenStack, Libreoffice, AOSP, Qt and Eclipse
Time: how long this project use Gerrit code review (from the time they adopted Gerrit)
Repositories: how many repositories are involved
Patches: how many changes have been created
Participants: how many people have participated in