Since a couple of days ago, I am in Davis, California, for a 4-months visit to UC Davis, hosted by Prof. Prem Devanbu. This visit is possible thanks to a "José Castillejo" grant awarded by the Spanish Ministry of Education and Science.
The main goal of this visit is to work on finding an automated method to evaluate the bias in bugs datasets. This bias is introduced when the bug-fix reports are linked with commits in the version control system. When a developer accepts and/or fix a bug report, she decides and accordingly marks the report with a severity level. In Bugzilla, one of the most used bug tracking systems, a developer can mark severity using a seven levels scale. In a previous paper (PDF available), I have shown that not all developers use the same criteria to select the severity, and it should be enough with only three levels. This difference in the developers criteria to mark and classify bug reports is one of the sources of bias in the bug-fix datasets (PDF of the paper available). Another source of bias is developer confidence; not all developers mark commits or bug reports with commit ids when they are starting in a project, because they are afraid of exposing themselves. However, those commits do correspond to bug fixes, and should be accounted for in a bug-fix dataset.
This bias disease affects the Eclipse Bug Data from Software Engineering Chair at Saarland University, which is one of the main data sources used for empirical software engineering. As an example, a paper studying the distribution of software bugs which was based on that Eclipse data has generated a response which has found other better distribution fits, and that does not reuse the same dataset but gathers the data directly from the original sources.
Clearly, reusing datasets for empirical software engineering is a good idea, which fosters reproducibility and verifiability, essential properties of any empirical research discipline. However, if we can not assure the quality of the reusable datasets, reusable datasets can cause more harm than benefits.
My goal with this visit is to apply statistical methods to evaluate the bias in a bug-fix dataset. The two papers about the distribution of bugs in Eclipse are an example of the kind of work I want to do. If we can be sure of the quality and lack of bias of a dataset, carefully built to act as a "canonical" dataset, we can compare other datasets against that canonical dataset, to find out if there is any bias. The two papers about Eclipse mentioned above show that the distribution of bugs can vary in the presence of bias. The first paper used a biased dataset, and the second paper repeated the data gathering process from scratch, avoiding the use of the biased dataset. Although it can also be due to methodological differences, they found different distributions for software bugs.
So my goal is to measure this difference in the distribution using a statistical technique, to detect the presence of bias, and develop a statistical test to find bias in reusable datasets. I am assuming here that the distribution cannot change due to other factors (and we already know that there are other sources of bias in bug reports), and that the shape of the distribution is unique. The second assumption is quite fair, but the first assumption is more complicated, and it will require to find more than one dataset that is known to be unbiased.
I hope this work will provide a tool to assess the quality of a bug-fix dataset, and to avoid the problems of bias, which are a threat to the validity of all the empirical studies using these bug-fix datasets.