The document discusses analyzing the social aspects of open source software ecosystems by studying software evolution over time. Specifically, it aims to analyze how social factors like community interactions, structures, and processes coevolve with the software product and process. The researchers plan to mine data from various open source project repositories, select appropriate projects, and use frameworks like Herdsman and FLOSSMetrics to extract and merge identity data, perform statistical analysis and visualization. This will help understand success factors for open source projects and identify best practices.
A Beginners Guide to Building a RAG App Using Open Source Milvus
Analysing the evolution of social aspects of open source software ecosystems
1. Analysing the evolution of
social aspects of open
source software ecosystems
Tom Mens & Mathieu Goeminne
Service de Génie Logiciel
Département des Sciences Informatiques
Faculté des Sciences - Université de Mons
Belgique
2. Research topic
• Study of software evolution
• in particular, the software process and software
quality
• Focus on software ecosystems (see next slide)
• Focus on social aspects (see next slide)
• By analysing and visualising historical data of open source
projects
• obtained by mining and combining data from
different types of repositories
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
3. Goal
• Analyse the success factors (e.g. popularity,
quality) of OSS projects?
• What are good and bad practices of OSS evolution?
What lessons can we learn?
• Study how social aspects co-evolve with, and
affect the software product and process
• Exploit this knowledge to improve the current
practice (e.g., guidelines, tool support)
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
4. Software ecosystem
• There are many views on a software ecosystem. Our focus:
• the ecosystem of a single project, containing all artefacts (code,
documentation, etc.) and persons involved in using, producing or
modifying these artefacts
• the ecosystem of a coherent collection or distribution of individual
projects
• Example: GNOME desktop environment for Linux, containing
hundreds of applications/project
• The ecosystem of each individual GNOME project can be studied
• The interaction between all GNOME projects and their
contributors can be studied
• Other examples: Linux OS distributions (e.g. Debian, Ubuntu)
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
5. Social aspects
• Which communities (e.g. users, developers) are involved in a
software project?
• relation between project quality and popularity?
• How do communities communicate / interact?
• e.g. how are developers driven by user requests and how does this influence
the project evolution?
• How are communities structured?
• relation between community structure and software quality or maintainability?
• How is work distributed among persons?
• relation between distribution of work / responsibility and maintainability?
• Which processes (formal or informal) are used?
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
6. Why open source?
• Free access to source code, defect data,
developer and user communication
• Historical data available in open repositories
• Observable communities
• Observable activities
• Increasing popularity for personal and
commercial use
• A huge range of community and software
sizes
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
7. Methodology
• Exploit available data from different repositories
• code repositories (e.g. SVN, Git, ...)
• mail repositories (mailing lists)
• bug repositories (bug trackers)
• Select open source projects
• Use Herdsman framework
• Based on FLOSSMetrics data extraction
• Use identity merging tool
• Use of statistical analysis and visualisation
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
8. Selecting Projects
• Criteria for selecting projects
• Availability of data from repositories
• Data processable by FLOSSMetrics tools
• CVSAnaly2, MLStats, Bicho
• Size of considered projects: persons involved,
code size, activity in each repository
• Age of considered projects
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
11. Comparing Merge Algorithms
• Based on a reference merge model
• manually created
• iterative approach
• relying on information contained in different
files (COMMITTERS, MAINTAINERS, AUTHORS, NEWS, README)
• Compute, for each algorithm, precision and
recall w.r.t. reference model
12. Comparing Merge
Algorithms
• Simple
• Bird (code and mail repositories only)
• based on Levenshtein distance
• Bird, extended for bug repositories
• Improved
• Combining ideas from Bird and Robles
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
16. Analysing
Activity distribution
Three different longitudinal studies
1. Distribution of work, for a single project, using coarse-grained data
from different repositories
How are developer activities (commits, mails, bug fixes)
distributed across repositories? How is activity distributed
across developers? How does this evolve over time?
2. Distribution of work, for a single project, using fine-grained data
from a single repository
Based on commits of different file types in version repository
3. Distribution of work across different projects belonging to the
same community (e.g. GNOME)
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
17. Activity distribution
First study
• Analyse, for individual Gnome projects,
historical data from version repositories,
bug trackers and mailing lists
• Focus on 3 types of activities per person:
commits, mails, bug report changes
• How are activities distributed: over
different persons, over time?
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
19. Activity distribution
First study
• Evidence of Pareto principle (20/80 rule)?
• Most activity is carried out by a small
group of persons
• Typically : 20% do 80% of the job
• Distribution of code activity more unequal
than mail activity
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
20. Activity distribution
First study
• Analyse activity distribution over time
• Use econometrics
• express inequality in a distribution
• aggregation metrics:
Gini, Hoover, Theil (normalised)
• Values between 0 and 1
• 0 = perfect equality; 1 = perfect inequality
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
21. Activity distribution
First study
1
Evolution of aggregation indices for Evince
0.9
0.8
0.7
0.6
0.5
0.4 Gini
Hoover
0.3
Theil (normalised)
0.2
0.1
0
Apr-99 Dec-99 Aug-00 Apr-01 Dec-01 Aug-02 Apr-03 Dec-03 Aug-04 Apr-05 Dec-05 Aug-06 Apr-07 Dec-07 Aug-08
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
22. Activity distribution
First study
$"
!#,"
!#+" Evolution of Gini index
!#*"
!#)"
!#("
!#'"
!#&"
.4556/7"
58697"
Brasero
!#%" :;"0".<8=>?7"
!#$"
!"
-./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!"
$"
!#,"
!#+"
!#*"
!#)"
!#("
!#'" 1233456" Evince
!#&" 37486"
!#%" 9:;"/<.2/5"1=7>;<6"
!#$"
!"
-./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!'" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!"
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
23. Activity distribution
First study
• Identifying core groups
• Display Venn diagrams of most active (top
20) persons, according to each activity
type (committing, mailing, bug report
changing)
• For each person, show percentage of
activity attributable to this person
• Take into account identity merges
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
24. Activity distribution
Identifying core groups
Brasero
Evince
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
25. Activity distribution
First study
• Conclusion
• Activity distributions seem to become
more and more unequally distributed
• The Pareto principle is clearly present in
studied projects
• For Brasero and Evince, the activity is led
by a few persons involved in 2 or 3 of the
defined activities
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
26. Activity distribution
First study
• Future work
• Identify the type of statistical distribution
• Use sliding window approach
• detect impact of personnel turnover: ignore
inactive persons, and discover new active
persons
• Automate detection of core groups
• Study the evolution of core groups over time
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
28. Activity distribution
Second study
• Analysing fine-grained activities, restricted to
source code repository
• Analyse contributors to a project based on
types of activity they contribute to
type of activity file type
coding .c, .h, .cc, .pl, .java, .ada,.cpp, .chh, .py
documenting .html, .txt, .ps, .tex, .sgml, .pdf
translation .po, .pot, .mo, .charset
multimedia .mp3, .ogg, .wav, .au, .mid
...
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
29. Activity distribution
Second study
•Example: Rhythmbox
• overlap between 5 activity types
A = code, B = build, C = devel-doc, D = translation
• Visualised using Venn diagram
values represent A
B
5
C
4 D
number of persons 53
15
6
84
71
involved in a particular 19 51
activity (i.e., having 19
committed at least 16 7
one file of this type) 4
0
1
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
30. Activity distribution
Second study
Evince - scatter plots of correlation
coding versus developer translation versus
coding versus translation
documentation developer documentation
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
31. Activity distribution
Second study
Evince - correlation matrix
Correlations with p-value > 0.01 not shown;
Medium (>0.5) to strong correlations (>0.8) are coloured
Evince document images i18n ui multimedi code meta config build devel.doc other
ation a
documenta 1 0,14715687
tion
images 1 0,16954996 0,20213002 0,23079965 0,41266264 0,4596056 0,4757661 0,44918361 0,19411888
i18n 1 0,15575507 0,13010625 0,17912559 0,15815786 0,30906678
ui 1 0,16142962 0,56632106 0,31157703 0,54728747 0,55080516 0,51066149 0,16779495
multimedia 1 0,21869572 0,41933948 0,33361174 0,36632992 0,20712209
code 1 0,30763128 0,8242121 0,90169181 0,87365497 0,4087696
meta 1 0,31914846 0,54909454 0,2434247 0,61723848
config 1 0,89979553 0,95124778 0,43475402
build 1 0,90540444 0,58183399
devel.doc 1 0,41281866
other 1
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
32. Activity distribution
Second study
• Correlation graph for
Evince docume
ntation
2,2%
•
code config images
Nodes represent relative 46,2% 0,8% 1,2%
amount of activity of meta
1,6%
each type
• Edges represent a build
16,1%
devel-doc
19%
i18n
12,6%
statistically significant
(p<0.01) high correlation
(>0.8) Edges with weak or not statistically significant correlation,
and activity nodes <0,5% are not shown.
Université de Mons IWSECO 2011, Brussels Tom Mens & Mathieu Goeminne
33. Activity distribution
Second study
• Evince: Interpretation of results
• The activities of building, docume
ntation
2,2%
coding and development code
documentation are done by
config images
0,8% 1,2%
46,2%
the same group of persons meta
1,6%
• Translators (i18n) are also
involved in development build devel-doc i18n
documentation but not in coding 16,1% 19% 12,6%
• Other documentation is largely
independent from these activities
Université de Mons IWSECO 2011, Brussels Tom Mens & Mathieu Goeminne
34. Activity distribution
Second study
• Conclusion
• Some activities are done by same group of persons, other
activities are largely independent
• Future work
• Study and compare the activity patterns on other projects
• Study how the separation of activities influences the process
• Compare activity distribution of different types of
distributions
• E.g. translators have a more equal distribution of work
than coders
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
35. Activity distribution
Third study
• Study the activity of persons across software
projects in the GNOME ecosystem
• Case: Large long-lived GNOME projects using Git
ID Project name Authors Years Files
A Banshee 268 5,9 2700
B Rhythmbox 364 8,9 937
C Tomboy 290 6,6 766
D Evince 381 12,1 699
E Brasero 193 4,2 797
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
36. Activity distribution
Third study
• How are activities distributed in different GNOME projects?
• counted in terms of number of files modified
• code files are most frequently committed, followed by
build, development documentation and translation files
'!!!!"
$!!"# "1?=@1-A8)""
,!"# "30,-:"
#6DBE62F=.## &!!!!" "?8""
+!"#
#8512?#
"A39?1-*0)@3*""
*!"#
#D=##
%!!!!" "1-0)""
)!"# #F8>D62/5.E8/##
"81)B-+""
#625.##
(!"#
#=6.G20## "93*CB""
$!!!!"
'!"# #>8/HG## "8#D*""
&!"# #=$+/## "A-7-=EA39""
#F2<2BIF8>## #!!!!"
%!"# "2?8=A""
#7D=BF##
"93A-""
$!"# #>8F2##
!"
!"#
()*+,--" .,/0,1234" 53123/" 678*9-" (:)+-:3" ;,30<-==" >,--+-"
-./0122# 314516789# :86784# ;<=/>2# -?.02?8# @185A2BB# C12202#
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
37. Activity distribution
Third study
• How are activities distributed in different GNOME projects?
• counted in terms of number of persons involved in modifying files
• most persons are involved in translations,
followed by development documentation,
followed by code and build
,!"#
+!"#
A;7<?00#
*!"#
)!"# B?C=?:1.D#
(!"# E.:1.C#
'!"#
&!"#
%!"#
$!"#
!!"#
##
##
##
;##
#
;##
##
##
##
3##
#
7#
0@
/0
0<
4/
.-
+7
9
#2
0=
78
3
>.
0/
=?
23
;9
6/
#-.
#3$
#:
#1
#-.
#.
=;
>:
04
#3 :
07
05
24
2:
#/
#:
.-
#/
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
38. Activity distribution
Third study
• How many persons are contributing to
more than 1 GNOME project?
B
162
C
A 22 9
18
8 4
7 77
11 21 4
127
4 18
15 70 24
4
6
1
1
6
1 1
4 5
7 11
12
148
26
E D
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
39. Activity distribution
Third study
• Which types of activities are carried out by persons involved in multiple
projects?
Mainly translators. Coders tend to stick to one project
B B
106 66
C C
A 9 0 A 9 9
0
1 0 5 2 17
0 46 6 32
0 1 0 5
96 37 10 21
0 0 3 17
10 0 2 1 69 24
0 4
2 6
0 1
1 1
3 2
0 1 1 0
0 0 4 5
5 5 3 7
2 6
107 49
6 23
E D D
E
coders translators
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
40. Activity distribution
Third study
• Work in progress
• Study activities of developers across
different projects
• How, when and why do developers
contribute to different projects?
• Simultaneously
• Over time
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
41. Thank you
Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne