What's the distribution of software size?

The statistical distribution of software size can be used to estimate a priori the effort and size that a project will take. In the research literature, the software size distribution has been found to be lognormal. However, we have found that the lognormal distribution systematically underestimates the size of software, and that the double Pareto distribution seems to fit better the size of source code files.

If you want to know more, read our ICSOFT 2011 paper. Before reading it, you may want to have a look at slides of the presentation at the conference, or at the tag cloud generated from the paper text:

ccdf code data distribution double estimation figure files function java languages law lines lisp lognormal models parameters pareto power programming python sample shell size sloc software source statistical threshold values
created at TagCrowd.com

To cite this paper, there is a BibTeX file available, or you can copy from below

  author    =    {Israel Herraiz 
                  and Daniel German 
                  and Ahmed E. Hassan},
  title     =    {On the distribution of source code file sizes},
  booktitle =    {International Conference 
                  on Software and Data Technologies},
  year      =    {2011},
  address   =    {Seville, Spain},
Written on Jul 20 2011 | Tags: #research, #icsoft, #msr, #paper
