herraiz.org | PhD

Table of Contents

Main | Blog | Research papers | PhD thesis | GnuPG (PGP)

PhD thesis

On October 24th 2008, I obtained my PhD degree with European mention at Universidad Rey Juan Carlos, after the defense of a thesis entitled A statistical examination of the evolution and properties of libre software.

The committee's members were:

I wrote this thesis and conducted my research work under the supervision (I would rather say mentoring) of Gregorio Robles and Jesús M. González-Barahona.

Most of my work was done with the support of the funding provided by the Comunidad de Madrid (the regional government of Madrid) and the European Social Fund, under grant number 01/FPI/0582/2005.


A statistical examination of the evolution and properties of libre software (PDF, 7.9 MB)

Download the BibTeX citation. Please cite my thesis in your work if it is relevant for it.


How and why does software evolve? This question has been under study since almost 40 years ago, and it is still a subject of controversy. After many years of empirical research, Meir M. Lehman formulated the laws of software evolution, which were a first attempt to characterize the dynamics of the evolution of software systems.

With the raising of the libre (free / open source) software development phenomenon, some cases that do not fulfill those laws have appeared. Are Lehman's laws valid in the case of libre software development? Is it possible to desing an universal theory for software evolution? And if it is, how?

This thesis is a large-scale empirical study that uses a statistical approach to analyze the properties and evolution of libre software. The studied properties are size and complexity. For that study, we have used a set of thousands of software systems, extracted using the packages system of FreeBSD. The evolution study was done using another set of thousands of software projects hosted in SourceForge.net.

With the first set, we measured different size and complexity metrics of the source code of the packages in FreeBSD, and calculated the correlations among the different metrics. We also estimated the distribution function of those properties.

Regarding the second set, we obtained the daily series of number of changes. We applied Time Series Analysis to estimate the kind of process that drives software evolution. We used ARIMA (Auto Regressive Integrated Moving Average) models to forecast evolution.

The results show that a small subset of basic size metrics are enough to characterize a software system. Furthermore, the shape of the distribution of those metrics suggests that the Random Forest File Model could be used to simulate the evolution of a software product.

Using Time Series Analysis (TSA), we have found that software evolution is a short memory process. That implies that statistical models of evolution based on TSA are a better option than regression models for forecasting purposes.

Finally, the shape of the distribution of size is the same, regardless of the level of aggregation used to measure it (file, module, software project, etc). That is an evidence of self-similarity in software, and could be an explanation of the fast growth patterns observed in some libre software projects.

Another remarkable contribution of this thesis is that it shows how to perform an empirical study at a large scale, using publicly available data sources. Thanks to this, all the results are repeatable and verifiable by third parties. Therefore, the conclusions of this thesis can be the beginning of a theory of software evolution that is based on empirical findings verified in thousands of software systems.

Presentation video

Thanks to the invaluable help of Marcos Lara (member of the Libresoft research group), there is a video available of my presentation.

You can watch it online here, or at Vimeo:

Author: Israel Herraiz

Created: 2016-07-04 Mon 16:08

Emacs 24.5.1 (Org mode 8.2.10)