The document describes an empirical study on specialization in open source software communities. The study analyzes activity and workload data from over 1,300 projects and 5,000 developers in the GNOME desktop environment. Key findings include that most developers and projects concentrate their work in a few activity types, particularly coding, documentation, and translation. The analysis uses metrics like the Gini index to measure specialization at the developer and project level across different activities like coding, documentation, translation, and more. Overall, the study finds evidence that open source work exhibits specialization, with developers and projects focusing more intensively on certain types of activities.
Oppenheimer Film Discussion for Philosophy and Film
An empirical study on the Specialisation Effect in Open Source Communities
1. An empirical study on the
Specialisation Effect in
Open Source Communities
Mathieu Goeminne & Tom Mens
University of Mons
Bogdan Vasilescu & Alexander Serebrenik
Eindhoven University of Technology
1
2. Our case study: Gnome
• A large ecosystem
• ~1,300 projects having a common
characteristic : integration in the same GNU/
Linux Desktop environment
• A large developer community
• > 5,000 involved authors in the Git repositories
• A set of activity types
• coding, documentation, translation, etc.
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
2
3. Methodology
• Goal-Question-Metric approach:
• Define research goals
• Define questions to reach these goals
• Define metrics the answer these
questions
• Use metrics to statistically verify
hypotheses about the questions
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
3
4. Research Goals:
understand
1. how development effort varies across
projects;
2. how development effort varies across
developers;
3. how the type of development activity
relates to the development effort.
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
4
5. Research Questions
• Are the projects containing more activity
types more active? [Goal 1]
• How specialised are the projects towards
different activity types? [Goal 1]
• What influences the project specialisation?
To which extent are developers specialised
in different activity types? [Goal 2]
• etc.
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
5
6. Workload and
Involvement Metrics
• workload APTW(p,d,t)
• # files in project p touched by developer d having touched a file of
activity type t
• involvement APTI(p,d,t)
• 1 if APTW(p,d,t) > 0; 0 otherwise
• Higher level metrics : aggregation over all projects, developers and/or
activity types. Ex:
• NPD(d) = Number of projects in which developer d is involved
• PW(p) = Global workload of project p
• RPTW(p,t) = Workload in p w.r.t activity type t, relative to the global
project workload
• PWS(p) = Specialisation (imbalance or unequal distribution) of
workload across project activity types (using Gini index).
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
6
7. The Gini Index
Economic measure of inequality
• Aggregates a collection of values (e.g.
incomes in a country)
• A value between 0 and 1 :
• 0 if everybody has the same income
• 1 if a person has all the income and the
others don’t have anything.
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
7
8. Experimental setup
• Extract data from Gnome’s Git project repositories
• Populate databases using this data
• Detect multiple logins belonging the same ‘real’
developer thanks to an identity matching algorithm.
• Define activity types carried out by project
developers
• Compute metrics
• Verify statistical hypotheses
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
8
9. Activity types
• 13 activity types, including:
• coding, translation, documentation, building,
development documentation.
• File classification: based on the file path, name
and extension and a set of rules. A rule is a type
and a regular expression.
• Ex: (doc, ‘.*/DOC(-?)BOOK(S?)/.*’)
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
9
10. How is the activity distributed
across activity types?
0.4
0.3
0.2
0.1
0.0
develdoc
build
doc
unknown
test
code
translate
mmedia
library
image
config
ui
meta
database
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
10
11. Developer workload distribution
Imbalance by activity type
RAWS(p)
● ● ● ●●●●●
● ●● ●●●●
●●●●●
●●●●
●●●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0
Most developers concentrate their
workload in few activities
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
11
12. Project workload distribution
Imbalance by activity type
RPWS(p)
● ●●●
●●
0.0 0.2 0.4 0.6 0.8 1.0
Most projects concentrate their
workload in few activities
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
12
17. Future work
• Take into account other factors
• time: How do our metrics and their relations evolve
over time?
• project categories (e.g. archived, deprecated, ...)
• project age and maturity
• main programming language
• ...
• Develop a dashboard tool to detect, prevent and predict
health problems in evolving software ecosystems
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
17
18. Thank you
BENEVOL 2011 Empirical Study on Specialisation in FLOSS communities 08/12/11
18