Rich Sands, Director of Developer Communities at Black Duck, presented these interesting statistics on open source projects from Ohloh.net at the 2012 Linux Foundation Collaboration Summit.
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Open Source By The Numbers
1. Open Source By The Numbers
Rich Sands
Director of Developer Communities
Black Duck Software, Inc.
2. How Big is FOSS?
• GitHub: 4,751,000 repositories
• SourceForge: 324,000 projects
• Ohloh: 550,000 projects
BIG
3. No, REALLY, How Big is FOSS?
• It depends on how you count.
• Lots of projects, but
– How many are active, how many abandoned?
– How many have a team?
A better question to ask:
How much FOSS is actually being
worked on now?
4. How Many Projects are Active?
• 550,000+ projects on Ohloh.
• 271,372 with a code analysis.
• 96,824 with a commit in the past 2 years.
• 46,883 with a commit in the past year.
• 29,303 with a commit in the past 6 months.
• 21,251 with a commit in the past 3 months.
• 12,870 with a commit in the past month.
• 5,629 with a commit in the past week.
• 1,224 with a commit in the past day (3/30-3/31, a
weekend)
5. How Many Projects Are Active?
6000
Days Since Last Commit
5000
4000
3000
17.3%
2000
1000
1 Yr
100 90 80 70 60 50 40 30 20 10
% of Analyzed Projects With a Commit
in the last Y Days
6. But Do All These Projects Have a Team?
2827
Number of Committers
50
40
30
49.3%
2 or more
8.5% of all analyzed projects
20
10
2
100 90 80 70 60 50 40 30 20 10
% of Active Projects With
At Least Y Committers
7. What is a “Live” Project, Anyway?
• Lets invent a new metric – “Liveness”:
– At least one commit in the last year, and at least 2
committers for liveness to be non-zero.
– Time-weighted roll-up of activity, where older
activity counts less than more recent activity.
– For this presentation, activity is committer count.
– Exponential time-weighting decay such that the
most recent month’s activity counts fully, and 11
months back activity counts nearly zero.
– Normalized; liveness of the Linux Kernel = 1000.
8. Sniff Test – What Are the
Top 50 Live Projects?
1000.00 Linux Kernel 118.32 openstack's nova
711.40 Chromium (Google Chrome) 117.52 The LLVM Compiler Infrastructure
516.68 KDE 115.64 llvm-mirror
491.68 Mozilla Firefox 115.20 NetBeans IDE
491.37 Mozilla Core 114.01 JBoss Application Server
473.17 Boot To Gecko 113.96 NetBSD
396.51 GNOME 112.73 JBossAS7
322.54 Homebrew 112.73 JBoss Application Server 7
319.47 Gentoo Linux 109.69 Jenkins
300.32 WebKit 109.26 U-Boot
273.38 Qt 5 108.83 Go programming language
226.36 FreeBSD Ports 107.60 tav's go
194.87 OpenStack 105.34 QEMU
163.50 docrails 103.62 pkgsrc: The NetBSD Packages Collection
163.19 Ruby on Rails 101.50 platform_frameworks_base
159.54 Android 101.39 Trinity Core
155.82 LibreOffice 100.86 LLVM C/Objective-C/C++ frontend (old)
154.54 MediaWiki 100.86 LLVM/Clang C family frontend
146.83 FreeBSD 100.16 Symfony
145.55 GNU Compiler Collection 95.76 WSO2 Business Process Server
129.94 FFmpeg 94.23 Intellij Community
124.62 OpenERP 90.95 Wine
123.33 SBo-git 89.46 Qt 4
123.33 SlackBuilds.org 89.01 XBMC Media Center
118.32 OpenStack Nova 88.39 Chromium Tools (Google Chrome)
Note – there are a few duplicates and mirrors in this list
9. How Big Are Live Projects?
60M
Aptosid (Debian distro)
No really big,
50M “Distros” really active
projects
Lines of Code
Android
Platform Frameworks Base
40M
For most projects,
30M bigger means
less active
Linux Kernel
20M
Android
“Famous” KDE
LibreOffice projects
10M FreeBSD
Firefox
GCC GNOME
Chromium
MySQL WebKit
Git Qt
Ruby on Rails
0 250 500 750 1000
Top 5000 live projects Liveness (0-1000 scale)
10. How Does Size Relate to Committer
60M
Count?
50M
Lines of Code
40M
Similar effect – larger
30M means fewer committers
Linux Kernel
20M
Android
KDE
LibreOffice
10M
Firefox GNOME
GCC Chromium
Qt
Ruby on Rails
0 1000 2000 3000
Top 5000 live projects 1-Year Committer Count
11. Languages of Live Projects
Perl C#
Ruby
Java
PHP
JavaScript
C
Python
Top 5000 live projects C++
Other
12. Average Liveness By Language
C#
JavaScript
Perl
Python
Java
PHP
Ruby
C
C++
0 2 4 6 8 10 12 14 16 18
Top 5000 live projects
Liveness
13. Average Project Size By Language
Ruby
Python
Perl
C#
JavaScript
PHP
Java
C
C++
0 1 2 3 4 5 6 7 8
Top 5000 live projects
Millions of lines of code
14. Language vs. Number of Committers
100000
Total Committers
10000
1000
Java C C++ Python JavaScript PHP Ruby C# Perl
All-time Committers 1 Year Committers 30 Day Committers
Top 5000 live projects
15. Languages of New Projects – Then and
Now
30%
% New Projects Primary Language
25%
20%
15%
10%
5%
0%
Java Other C++ C Python PHP JavaScript C# Ruby Perl
Started 5 Years Ago Started in the Past Year
16. The 8 Most Live New Projects
in the Past Year
Project Description Why So Active?
oVirt-engine KVM Management System Open governance, backed by Cisco,
(Liveness: 50.5, 457K LOC, Java) Red Hat, IBM, Canonical, Intel, ... and
aims at a burning problem.
WebRTC Implements W3C RFC for streaming Supported by Google (Chrome),
media JavaScript API Mozilla, and Opera, a core HTML5
(Liveness: 47.3, 407K LOC, C++) streaming media std.
Khan Academy Crowdsourced exercises for a self- Very easy to contribute, taps altruistic
service educational platform. impulses of educators worldwide.
Exercises (Liveness: 44.3, 90K LOC,
JavaScript)
Twitter Bootstrap CSS, HTML, JavaScript toolkit for Heavily promoted by Twitter, high-
rapid webapp development. quality, aims at a burning problem.
(Liveness: 40.0, 41K LOC,
JavaScript)
Wikimedia Puppet Wikimedia’s Puppet configuration. Exemplary Puppet implementation by
(Liveness: 33.8, 37K LOC, Puppet) for a very heavily trafficked site.
Katello RHEL server system lifecycle mgmt. Announced at Red Hat 2011 Summit,
(Liveness: 31.6, 137K LOC, Ruby) follow-on to Satellite project.
Cloud Foundry VMWare’s PaaS platform. Substantial industry support,
(Liveness: 30.7, 29K LOC, Ruby) marketing. Aims at a burning
problem.
Composer Package manager for PHP. Aims at a burning problem for PHP.
(Liveness: 30.3, 14K LOC, PHP)
17. Open Source by the Numbers
• Only a small fraction of all the projects ever started gain long-
term traction.
• Less than 5% of all projects on Ohloh are “live”: a commit in
the past year, and more than 1 committer, ever.
• The larger the code base, the less contributors and activity.
• “Famous” projects are mostly Java and C-family, and these
older, established languages retain their dominant mindshare.
• New live projects trending towards Python, PHP, JavaScript
and away from C-family languages.
• The “most likely to succeed” new projects:
– Have big backers and marketing behind them.
– Are still small enough for people joining them to have an impact.
18. Questions?
Ohloh.net
Your guide to open source
Join the Ohloh community and gain critical insights into the world of
open source projects
Notas do Editor
Introduce selfExplain roleUseOhloh data to tease out some interesting facts about FOSS.
How do we even answer this question?Is it aboutThe number of repositories?The number of projects?How much code is under an approved license?The number of developers contributing?The number of commits?We know it is BIG but what does that mean?
Most of the size estimates don’t span multiple forges.Not all the repositories on a site like GitHub are part of FOSS projects.There aren’t any ‘complete” directories of FOSS projects.But that doesn’t really matter.For the purposes of this presentation we’re going to look at the data in Ohloh.Spans multiple forges, includes projects that host their own code too.But it doesn’t include everything.Still, it is a sizeable fraction of everything and a representative cross-section of projects.Those huge numbers of projects and repositories miss the point. The FOSS that matters is the FOSS that has activity and community.So we’ll focus on that subset for this presentation.
So how many projects are active?A little under ½ have a working repository with code.Only 35% of those have had activity in the past 2 years.
Here’s how the curve looks for projects with a code analysis.The vertical axis shows how long it has been since the last commit. The horizontal axis shows the percentage of projects that have had at least one commit within a particular timeframe.Lets pick a nice, arbitrary but reasonable definition of “active” – at least one commit in the past year.About 17.3% of all the projects with an analysis (46,883) have ha a commit within the past year.The rest of the projects can be considered “abandoned”.FOSS plants a lot of seeds but only a small percentage take root.So does that mean that we need to pay attention to about 47,000 projects? Not quite.
There is more to a project than commits. FOSS is collaborative. It is the product of a community.Lets set a bar for “community”. A pretty low bar. We’ll set it at 2. How many of the active projects have had more than one committer – ever?A bit less than ½ of the active projects have ever had more than one committer. The rest are someone’s private thing.So only 8.5% of all the projects have had a commit in the past year, and have a team of at least 2 people working on them. That is a little over 13,000 projects.Lets declare these projects to be “live” FOSS projects.
So how do we measureliveness? Can we come up with a score that:Puts projects onto a scoring continuum in sensible relationshipsSpreads out values enough so that well-known projects don't just bunch up at the high end of the scaleSmaller projects still have a meaningful "spread”Start with the basic definition of “Liveness” as we’ve seen so far. Projects that don’t clear the <year & team hurdles get a score of 0.Now make more recent activity count more than older activity.For the analysis in this presentation we used committer count. We could use commit count, or LOC deltas, or any number of other approaches combining these – need to experiment more to see what works best.Exponential decay helps spread out the projects and make the most active ones with both large and active teams really stand out.One project in particular has a much higher “Liveness” using this method – the Linux Kernel. That makes sense – lets make the Kernel = 1000 and normalize everything else to that.
Using this method, here is a Liveness list for the top 50. We see the Kernel at the top, and a number of famous projects with a lot of activity: Chrome, KDE, Firefox, GNOME, OpenStack, Android, …. represented at the top of this list. There are some duplicate and mirror repos here as well. Boot To Gecko, a project that recently got a lot of buzz at Mobile World Congress as a new approach to mobile OS design also shows up in the top 10.Further experimentation is needed to come up with a really solid “liveness” metric but even this basic approach seems to hold promise.
Lets start comparing projects on different dimensions using this liveness score. How big are the most live projects?Here is a scatter plot of the top 5000 live projects. What are we looking at?We can see that “famous” projects – those with large, active teams and lots of code activity spread out from the dense cluster of smaller, less well-known but still live efforts.A few projects with enormous code bases show up – these are “distros” that aggregate a lot of other projects and code.There are no really HUGE, active projects.In fact, the larger the project, the less “live” it is. There are always exceptions – like the Linux Kernel, which “prove the rule”.
One of the dimensions of liveness is committer count – and we see a similar effect.Bigger projects have fewer committers.Anyone want to speculate as to why that might be?Harder to understand a really big code base – fewer devs are able to dive in?Smaller projects are in more popular languages????
Speaking of languages – here is a breakdown of the primary language for the top 5000 live projects.As Steven O’Grady discussed at FOSDEM in his presentation “Java in the Age of the JVM” Java is [still] not dead.C-family languages are also very heavily used – Java + C + C++ is about half of all the actively developed projects.The rest are the hot dynamic languages with JavaScript and Python the big ones. “Other” which is everything else – Scala, Groovy, Clojure, Haskell, whatever, totals about the same as Python.
So which languages are the primary languages of the most live projects?Here is the average “liveness” of the projects by their primary language.The C-family languages are heavily represented in the largest, most active projects. Why? Perhaps because these infrastructure and OS projects are “system software” with a long history.Once we account for that history, we see more recent and popular languages “Duking” it out with Java.
We saw that project size (LOC) is inversely correlated to liveness.Java is used in a number of very large projects as well – could that account for its placement on the previous chart behind Ruby and PHP?Are Ruby and Python’s small typical project size a factor in these languages’ popularity?
Which languages attract the largest numbers of developers to live projects?Note the LOG scale here – otherwise the red and green bars would be too short to compare.No surprise here – Java, C, and C++ projects get the most number of committers, because these well-established languages are very well known by a large number of developers.We don’t see any unusual outliers when we look at committers in the last year, or in the last month.
But things start getting more interesting when we look at the primary language of NEW projects. (note – this is for all projects not just live ones – a lot of the ones started 5 years ago have since been abandoned).In fact, lets look at the primary languages of projects STARTED during 2006-2007, 5 years ago, vs. projects STARTED in the past year.We can see that when developers are cranking up something new, they’re experimenting with new languages (Other) and adopting Python, PHP, and JavaScript significantly more than they did five years ago.
Is there anything we can learn by looking at this past year’s new projects, and looking at the most live ones?A few things stand out:Most of the new, most-live projects are backed by major corporations, often as part of a consortium (oVirt, Twitter Bootstrap, Wikimedia Puppet, Katello, Cloud Foundry)Projects that build a reference implementation for an industry standard (WebRTC)Projects that are dead-simple to contribute to and leverage crowdsourcing (Khan Academy Exercises)More than anything else though, new, successful projects – those that have the best chance of emerging from the cluster of live projects to join the “famous” family – aim to solve a burning problem for a large user base.It seems like the corporate support may be a consequence of the burning problem effect.
What can we learn from all of this?This is just a snapshot of the kinds of analysis possible with a big pool of metadata on FOSS. There are many other ways to slice this data, and it doesn’t even really touch on the kinds of contributor or what could be gleaned from a people-centric view, rather than a project-centered view.Yes all that code out there is available for the taking and yes, even some abandoned projects are being checked out and used, or inspiring new projects.But only a tiny fraction (about 4.2%) of the total projects in Ohloh are “live”.By looking at the most live projects from different angles – size, language, etc. – we can start to see some patterns.Big code bases get less contributors, less activity.Famous projects have a long history, and get a huge share of the overall activity in FOSS. This skews language use and adoption towards the languages that were most popular back when these famous projects were young.Newer projects are able to adopt newer languages, and do.The most important factor in new project success is trying to tackle a really important problem. This leads to big backers, and marketing to drive awareness.New projects also gain contributors because they’re small enough that developers can make a real difference. “First mover” contributors on new projects can have a huge impact on how these projects evolve.