herraiz.org | Blog
Main | Blog | Research papers | PhD thesis | GnuPG (PGP)
The impact of bias in bug-fix datasets for defects prediction
Last week I gave a talk at UC Davis about the research work I will be doing during these months. It contains some preliminary results about the impact of bias in bug-fix datasets.
In projects with bug tracking systems and version control repositories, when a commit corresponds to a bug fix, it is usually marked accordingly (for instance, with a message like "Fixes bug #123"). This information can be used to recover the relation between commits and bugs, which is useful for defects prediction. The preliminary results I have obtained so far, show that the impact of bias is negligible for defects prediction if the model is based on a binary classifier (that is, only predicts whether an entity will contain or not defects, not how many defects it will contain). However, it is true that a non-biased dataset can provide a better accuracy, but just because, by definition, non-biased datasets contain more data. If we reduce the size of a non-biased dataset, by extracting a random sub-sample, it is as good as a biased dataset of the same size. Well, at least for the two cases I have studied so far.
More details in the slides. You can also see the slides at Slideshare.net, and get a PDF copy.
Written on Apr 15 2012 | Comments »Visiting UC Davis
Since a couple of days ago, I am in Davis, California, for a 4-months visit to UC Davis, hosted by Prof. Prem Devanbu. This visit is possible thanks to a "José Castillejo" grant awarded by the Spanish Ministry of Education and Science.
The main goal of this visit is to work on finding an automated method to evaluate the bias in bugs datasets. This bias is introduced when the bug-fix reports are linked with commits in the version control system. When a developer accepts and/or fix a bug report, she decides and accordingly marks the report with a severity level. In Bugzilla, one of the most used bug tracking systems, a developer can mark severity using a seven levels scale. In a previous paper (PDF available), I have shown that not all developers use the same criteria to select the severity, and it should be enough with only three levels. This difference in the developers criteria to mark and classify bug reports is one of the sources of bias in the bug-fix datasets (PDF of the paper available). Another source of bias is developer confidence; not all developers mark commits or bug reports with commit ids when they are starting in a project, because they are afraid of exposing themselves. However, those commits do correspond to bug fixes, and should be accounted for in a bug-fix dataset.
This bias disease affects the Eclipse Bug Data from Software Engineering Chair at Saarland University, which is one of the main data sources used for empirical software engineering. As an example, a paper studying the distribution of software bugs which was based on that Eclipse data has generated a response which has found other better distribution fits, and that does not reuse the same dataset but gathers the data directly from the original sources.
Clearly, reusing datasets for empirical software engineering is a good idea, which fosters reproducibility and verifiability, essential properties of any empirical research discipline. However, if we can not assure the quality of the reusable datasets, reusable datasets can cause more harm than benefits.
My goal with this visit is to apply statistical methods to evaluate the bias in a bug-fix dataset. The two papers about the distribution of bugs in Eclipse are an example of the kind of work I want to do. If we can be sure of the quality and lack of bias of a dataset, carefully built to act as a "canonical" dataset, we can compare other datasets against that canonical dataset, to find out if there is any bias. The two papers about Eclipse mentioned above show that the distribution of bugs can vary in the presence of bias. The first paper used a biased dataset, and the second paper repeated the data gathering process from scratch, avoiding the use of the biased dataset. Although it can also be due to methodological differences, they found different distributions for software bugs.
So my goal is to measure this difference in the distribution using a statistical technique, to detect the presence of bias, and develop a statistical test to find bias in reusable datasets. I am assuming here that the distribution cannot change due to other factors (and we already know that there are other sources of bias in bug reports), and that the shape of the distribution is unique. The second assumption is quite fair, but the first assumption is more complicated, and it will require to find more than one dataset that is known to be unbiased.
I hope this work will provide a tool to assess the quality of a bug-fix dataset, and to avoid the problems of bias, which are a threat to the validity of all the empirical studies using these bug-fix datasets.
Written on Apr 02 2012 | Comments »Popularity bias in bug datasets
In recent times, the replicability of Software Engineering empirical studies has become a main concern in the research community. One way to achieve replicability is by reusing datasets, so everybody base their results on the same data. However, if these datasets contain any kind of problem, they could cause more harm than benefits.
In the case of software defects, there are datasets that are known to contain bias, mainly when referencing a fix to a particular bug report.
We have studied a different kind of bias: popularity bias. A software project with less bugs is of higher quality. However, in open source software development, more bugs may mean more quality. Why? Because more found bugs imply more people looking for those bugs. This is, if you have no bugs it is because nobody is using your software and reporting them. If you have more bugs, it is because your software is popular; should your software be less popular, the number of bugs would be lower. We have studied this effect in the case of Debian, using the Ultimate Debian Database, and we indeed find that only very popular Debian packages will present a very high number of bugs, and that non-popular packages get very few bug reports.
If you want to know more, read our WCRE 2011 paper, entitled "Impact of Installation Counts on Perceived Quality: A Case Study on Debian". A tag cloud of the contents of the paper:
age analysis binary case data days debian defects developers different distribution engineering fixed groups higher installations number open packages popularity quality recent relationship reported shows software source study system userscreated at TagCrowd.comTo cite this paper, there is a BibTeX file available, or you can copy from below
@InProceedings{debian_wcre2011, author = {Israel Herraiz and Emad Shihab and Thanh H.D. Nguyen and Ahmed E. Hassan}, title = {Impact of Installation Counts on Perceived Quality: A Case Study on {D}ebian}, booktitle = {Proceedings of the 18th Working Conference on Reverse Engineering}, year = {2011}, publisher = {IEEE Computer Society}, }
Written on Nov 01 2011 | Comments »
Older posts
- IJSODIT - Call for papers 2012 (Sep 29 2011)
- The interplay between businesses and open source (Sep 08 2011)
- Software and the game of life (Jul 29 2011)
- What's the distribution of software size? (Jul 20 2011)
- Software projects alzheimer: Julian Assange's lost contributions (Jul 07 2011)
- Practical Analyses of Software Engineering Data (Jun 15 2011)
- Empirical Software Engineering in Practice -- CFP 2011 (Jun 13 2011)
- Grafiti no es negocio -- Mi visión sobre las acampadas (May 25 2011)
- IJSODIT - Call for papers 2011 (Mar 29 2011)
- Mis impresiones sobre el Día Garum (Mar 05 2011)
- Nos vamos a Bilbao (Feb 15 2011)
- Reflexiones sobre el ciberpunk (Feb 03 2011)
- The dynamics of software evolution (Jan 24 2011)
- ¿Cómo he llegado al itinerario? (Jan 10 2011)
- ¡Hola itinerario! (Jan 04 2011)
- Debian finally shipping a free kernel (Dec 15 2010)
- Freenet, an anonymous and distributed network (Dec 11 2010)
- PyTwerp working again with Twitter (Dec 10 2010)
- "Making software" is out! (Nov 22 2010)
- Do featured articles get more visits in Wikipedia? (Nov 15 2010)
- What is the MSR challenge? (Oct 11 2010)
- IWESEP 2010 -- International Workshop on Empirical Software Engineering in Practice (Aug 23 2010)
- Learning by doing (Aug 10 2010)
- Data for Mining Software Repositories (Jun 25 2010)
- The eye of the tiger: agile methods vs. architecture (Jun 21 2010)
- Code as design. Or what's the point of Software Engineering? (Apr 06 2010)
- Hello Linkedin (Apr 02 2010)
- Special issue of the IJOSSP (Feb 23 2010)
- Where are you? (Feb 05 2010)
- New GPG key (Jan 27 2010)
- Under attack (Jan 19 2010)
- Hello world (Jan 18 2010)