This Tuesday, I have the good fortune to give a presentation on N. Smith, A. Capiluppi, and J. Fernandez-Ramil’s classic journal paper “Agent-Based Simulation Of Open Source Evolution,” from Software Process: Improvement and Practice 2006; 11: 423-43. Well, if anything from 2006 can be a classic, F/OSS is the place.

Figuring out how Free Software evolves is a black art. There’s quite a bit of grant money in it and I’ve seen theories that do everything from trying to quantify the exact number of developers the core of a project must have to purporting to build a checklist of all features that define when you will be successful integrating Open Source into your organization.

In this case, Smith et al. have taken the CVS logs from the Gaim, Wine , Arla , and MPlayer projects, plotted how their complexity evolved over time, then tried to tweak a model of developer-agents until the virtual project’s complexity had the same shape as the real ones. They hope to use this to causal relationships between module fitness, complexity and other factors. You will have to make your own decision as to whether they succeeded.

For example, here’s the rate of growth of the four projects, relative to their current size:
3a.jpg

They also make charts counting the number of highly complex functions, defined by having a score of more than 15 on the McCabe index.

4b.jpg

And the “Complexity Control”, or how well the program is factored, of each of the 4 projects:

5ab.jpg 5cd.jpg

and finally, the Distribution of how many changes have been made to modules in a project. We would expect to see a small number of modules that get changed many times such as libraries and a large number of modules that receive only a few touches (ie., a long-tailed distribution). That is just what they give us:

6.jpg

Model
The authors decided to try running simulated developers on a simulated generic project using the NetLogo package. Each developer is an agent and is centered on a module. Each module stores two numbers: its complexity and its fitness. At any time, a developer is doing one of four things:

  • Creating a new module If the developer is on an unfilled requirement, it makes an new module.
  • Refactoring If the complexity of the module is too high, the agent refactors it. This reduces the complexity by a random amount.
  • Developing If the module is not being refactored, the agent will develop it. This increase the complexity and fitness by random amounts (I laughed out loud when I read that development increases fitness by a random amount, how appropriate).
  • Leaving Each developer has a boredom threshold. If the fitness of the module is too high, the developer will get bored and leave because the module is “too good” for further interesting work.

“Finally, modules have a chance to capture the attention of a developer passing through cyberspace.”

The model sometimes has problems converging to plausible results, but the authors explore 256 combinations of parameters, throw away the bad ones, and choose one of the rest which they are say are roughly similar.

Results
How does the simulation match up? I’ll put the empirical results side-by-side with the simulation results.

Growth Rates
The authors claim success on this one. You’ll recall the actual project growth rates:
3a.jpg
and here is the simulated growth rate:
7a.jpg

Complexity
The simulated result is the lower line in the right graph. The authors claim this line is “very similar to the empirical observations” (both lines in the left graph).
4b.jpg 7b.jpg

Complexity Control
The original pattern for all four projects:
5ab.jpg 5cd.jpg
and simulation which the authors write “is able to reproduce this pattern.”
8a.jpg

Distribution of Changes
Finally, the results for how often modules get changed.
Empirical:
6.jpg
and simulated:
8b.jpg

I’m not sure about you, but I consider the question still open.