In a manual examination of more than 7,000 issue reports from the bug databases of five open-source projects, we found 33.8% of all issue reports to be misclassified, that is, rather than referring to a code fix, they resulted in a new feature, an update to documentation, or an internal refactoring. This misclassification introduces bias in bug prediction models, confusing bugs and features: On average, 39% of files marked as defective actually never had a bug. We estimate the impact of this misclassification on earlier studies and recommend manual data validation for future studies.
The paper describing the issue of misclassified issue reports and their impact on data mining models is currently under submission for ICSE 2013. A technical report version of the paper can be downloaded using the PDF link below.
For more information, additional results, and the data sets containing the manual classified issue reports please visit the papers website at http://softevo.org/bugclassify/.