The impact of bias in bug-fix datasets for defects prediction

Last week I gave a talk at UC Davis about the research work I will be doing during these months. It contains some preliminary results about the impact of bias in bug-fix datasets.

In projects with bug tracking systems and version control repositories, when a commit corresponds to a bug fix, it is usually marked accordingly (for instance, with a message like "Fixes bug #123"). This information can be used to recover the relation between commits and bugs, which is useful for defects prediction. The preliminary results I have obtained so far, show that the impact of bias is negligible for defects prediction if the model is based on a binary classifier (that is, only predicts whether an entity will contain or not defects, not how many defects it will contain). However, it is true that a non-biased dataset can provide a better accuracy, but just because, by definition, non-biased datasets contain more data. If we reduce the size of a non-biased dataset, by extracting a random sub-sample, it is as good as a biased dataset of the same size. Well, at least for the two cases I have studied so far.

More details in the slides. You can also see the slides at Slideshare.net, and get a PDF copy.

herraiz.org | Blog

The impact of bias in bug-fix datasets for defects prediction