Publications

Publications

Oct, 2016

ISSRE 2016

Switching to Git: the Good, the Bad, and the Ugly

Sascha JustKim Herzig, Jacek Czerwonka and Brendan Murphy

Since its introduction 10 years ago, GIT has taken the world of version control systems (VCS) by storm. Its success is partly due to creating opportunities for new usage patterns that empower developers to work more efficiently. However, the resulting change in both user behavior and the way GIT stores changes impacts data mining and data analytics procedures [6], [13]. While some of these unique characteristics can be managed by adjusting mining and analytical techniques, others can lead to severe data loss and the inability to audit code changes, e.g. knowing the full history of changes of code related to security and privacy functionality. Thus, switching to GIT comes with challenges to established development process analytics. This paper is based on our experience in attempting to provide continuous process analysis for Microsoft product teams who switching to GIT as their primary VCS. We illustrate how GIT’s concepts and usage patterns create a need for changing well-established data analytic processes. The goal of this paper is to raise awareness how certain GIT operations may damage or even destroy information about historical code changes necessary for continuous data development process analytics. To that end, we provide a list of common GIT usage patterns with a description of how these operations impact data mining applications. Finally, we provide examples of how one may counteract the effects of such destructive operations in the future. We further provide a new algorithm to detect integration paths that is specific to distributed version control systems like GIT, which allows us to reconstruct the information that is crucial to most development process analytics.

Jul, 2016

Book Chapter

Chapter in “Perspectives on Data Science for Software Engineering”

Tim Menzies, Laurie Williams, Thomas Zimmermann

About the book

Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics. At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches. Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid.

Chapter: Gotchas from mining bug reports

Sascha Just, Kim Herzig

  • [DOI] S. Just and K. Herzig, “Gotchas from mining bug reports,” in Perspectives on data science for software engineering, T. Menzies, L. Williams, and T. Zimmermann, Eds., Boston: Morgan Kaufmann, 2016, pp. 261-265.
    [Bibtex]
    @incollection{Just2016261,
    title = "Gotchas from mining bug reports ",
    editor = "Menzies, Tim and Williams, Laurie and Zimmermann, Thomas ",
    booktitle = "Perspectives on Data Science for Software Engineering ",
    publisher = "Morgan Kaufmann",
    edition = "",
    address = "Boston",
    year = "2016",
    pages = "261 - 265",
    isbn = "978-0-12-804206-9",
    doi = "http://dx.doi.org/10.1016/B978-0-12-804206-9.00047-7",
    url = "http://www.sciencedirect.com/science/article/pii/B9780128042069000477",
    author = "S. Just and K. Herzig",
    keywords = "Bug databases",
    keywords = "Fixing code changes",
    keywords = "Report categories",
    keywords = "False bugs",
    keywords = "Atomic changes",
    keywords = "Version control",
    keywords = "Tangled changes ",
    abstract = "Abstract Over the years, it has become common practice in empirical software engineering to mine data from version archives and bug databases to learn where bugs have been fixed in the past, or to build prediction models to find error-prone code in the future. However, most of these approach rely on strong assumptions that need to be verified to ensure that resulting models are accurate and reflect the intended property which can have serious consequences for decisions based on such flawed models. "
    }

Apr, 2015

EMSE 2015

Extended version of “The impact of tangled code changes on defect prediction models”

Kim HerzigSascha Just,  and Andreas Zeller

When interacting with source control management system, developers often commit unrelated or loosely related code changes in a single transaction. When analyzing version histories, such tangled changes will make all changes to all modules appear related, possibly compromising the resulting analyses through noise and bias. In an investigation of five open-source Java projects, we found between 7% and 20% of all bug fixes to consist of multiple tangled changes. Using a multi-predictor approach to untangle changes, we show that on average at least 16.6% of all source files are incorrectly associated with bug reports. These incorrect bug file associations seem to not significantly impact models classifying source files to have at least one bug or no bugs. But our experiments show that untangling tangled code changes can result in more accurate regression bug prediction models when compared to models trained and tested on tangled bug datasets–in our experiments, the statistically significant accuracy improvements lies between 5% and 200%. We recommend better change organization to limit the impact of tangled changes.

  • [DOI] K. Herzig, S. Just, and A. Zeller, “The impact of tangled code changes on defect prediction models,” Empirical software engineering, pp. 1-34, 2015.
    [Bibtex]
    @article{herzig-emse-2015,
    year={2015},
    issn={1382-3256},
    journal={Empirical Software Engineering},
    doi={10.1007/s10664-015-9376-6},
    title={The impact of tangled code changes on defect prediction models},
    url={http://dx.doi.org/10.1007/s10664-015-9376-6},
    link = {http://www.kim-herzig.de/2015/04/20/extended-version-of-the-impact-of-tangled-code-changes-on-defect-  prediction-models-emse-journal/},
    publisher={Springer US},
    keywords={Defect prediction; Untangling; Data noise},
    author={Herzig, Kim and Just, Sascha and Zeller, Andreas},
    pages={1-34},
    language={English}
    }

Nov, 2013

ISSRE 2013

Predicting Defects Using Change Genealogies

Kim HerzigSascha Just, Andreas Rau and Andreas Zeller

When analyzing version histories, researchers traditionally focused on single events: e.g. the change that causes a bug, the fix that resolves an issue. Sometimes however, there are indirect effects that count: Changing a module may lead to plenty of follow-up modifications in other places, making the initial change having an impact on those later changes. To this end, we group changes into change genealogies, graphs of changes reflecting their mutual dependencies and influences and develop new metrics to capture the spatial and temporal influence of changes. In this paper, we show that change genealogies offer good classification models when identifying defective source files: With a median precision of 73% and a median recall of 76%, change genealogy defect prediction models not only show better classification accuracies as models based on code complexity, but can also outperform classification models based on code dependency network metrics.

  • [PDF] K. Herzig, S. Just, A. Rau, and A. Zeller, “Predicting Defects Using Change Genealogies,” in Proceedings of the 2013 ieee 24nd international symposium on software reliability engineering, 2013.
    [Bibtex]
    @inproceedings{herzig-issre-2013,
    author = {Herzig, Kim and Just, Sascha and Rau, Andreas and Zeller, Andreas},
    title = {{Predicting Defects Using Change Genealogies}},
    booktitle = {Proceedings of the 2013 IEEE 24nd International Symposium on Software Reliability Engineering},
    series = {ISSRE '13},
    year = {2013},
    numpages = {10},
    publisher = {IEEE Computer Society},
    link={http://www.kim-herzig.de/2013/08/07/predicting-defects-using-change-genealogies-issre-2013/},
    pdf={http://www.kim-herzig.de/wp-content/uploads/2013/11/issre2013-genealogies-CAMERA.pdf}
    }

May, 2013

ICSE 2013

It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction

Kim Herzig, Sascha Just and Andreas Zeller

In a manual examination of more than 7,000 issue reports from the bug databases of five open-source projects, we found 33.8% of all issue reports to be misclassified, that is, rather than referring to a code fix, they resulted in a new feature, an update to documentation, or an internal refactoring. This misclassification introduces bias in bug prediction models, confusing bugs and features: On average, 39% of files marked as defective actually never had a bug. We estimate the impact of this misclassification on earlier studies and recommend manual data validation for future studies.

  • K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: how misclassification impacts bug prediction,” in Proceedings of the 2013 international conference on software engineering, Piscataway, NJ, USA, 2013, p. 392–401.
    [Bibtex]
    @inproceedings{herzig-icse-2013,
    Address = {Piscataway, NJ, USA},
    Author = {Kim Herzig and Sascha Just and Andreas Zeller},
    Booktitle = {Proceedings of the 2013 International Conference on Software Engineering},
    Date-Modified = {2013-05-28 10:37:50 +0000},
    Institution = {Universit{\"a}t des Saarlandes, Saarbr{\"u}cken, Germany},
    Keywords = {Mining software repositories, bug reports, data quality, noise, bias},
    Link = {http://mozkito.org/modules/issues/its-not-a-bug-its-a-feature-on-the-data-quality-of-bug-databases-icse_2013/},
    ISBN = {978-1-4673-3076-3},
    Location = {San Francisco, CA, USA},
    Pages = {392--401},
    acmid = {2486840},
    Publisher = {IEEE Press},
    Series = {ICSE '13},
    Title = {It's not a Bug, it's a Feature: How Misclassification Impacts Bug Prediction},
    Url = {https://users.own-hero.net/~methos/dropbox/icse13main-p180-p-16747.pdf},
    Year = {2013},
    Bdsk-Url-1 = {https://users.own-hero.net/~methos/dropbox/icse13main-p180-p-16747.pdf}}

Sep, 2008

TSE 2010

What Makes a Good Bug Report?

Thomas Zimmermann, Rahul Premraj, Nicolas Bettenburg, Sascha Just, Adrian Schröter, and Cathrin Weiss
In software development, bug reports provide crucial information to developers. However, these reports widely differ in their quality. We conducted a survey among developers and users of APACHE, ECLIPSE, and MOZILLA to find out what makes a good bug report. The analysis of the 466 responses revealed an information mismatch between what developers need and what users supply. Most developers consider steps to reproduce, stack traces, and test cases as helpful, which are, at the same time, most difficult to provide for users. Such insight is helpful for designing new bug tracking tools that guide users at collecting and providing more helpful information. Our CUEZILLA prototype is such a tool and measures the quality of new bug reports; it also recommends which elements should be added to improve the quality. We trained CUEZILLA on a sample of 289 bug reports, rated by developers as part of the survey. The participants of our survey also provided 175 comments on hurdles in reporting and resolving bugs. Based on these comments, we discuss several recommendations for better bug tracking systems, which should focus on engaging bug reporters, better tool support, and improved handling of bug duplicates.

  • T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schröter, and C. Weiss, “What makes a good bug report?,” Ieee transactions on software engineering, vol. 36, iss. 5, p. 618–643, 2010.
    [Bibtex]
    @article{zimmermann-tse-2010,
    title = "What Makes a Good Bug Report?",
    author = {Thomas Zimmermann and Rahul Premraj and Nicolas Bettenburg and Sascha Just and Adrian Schröter and Cathrin Weiss},
    year = "2010",
    month = "September",
    journal = "IEEE Transactions on Software Engineering",
    number = "5",
    pages = "618--643",
    volume = "36",
    link={http://thomas-zimmermann.com/publications/details/zimmermann-tse-2010/}
    }

Sep, 2008

VL/HCC 2008

Towards the next generation of bug tracking systems

Sascha Just, Rahul Premraj and Thomas Zimmermann

Developers typically rely on the information submitted by end-users to resolve bugs. We conducted a survey on information needs and commonly faced problems with bug reporting among several hundred developers and users of the APACHE, ECLIPSE and MOZILLA projects. In this paper, we present the results of a card sort on the 175 comments sent back to us by the responders of the survey. The card sort revealed several hurdles involved in reporting and resolving bugs, which we present in a collection of recommendations for the design of new bug tracking systems. Such systems could provide contextual assistance, reminders to add information, and most important, assistance to collect and report crucial information to developers.

  • S. Just, R. Premraj, and T. Zimmermann, “Towards the next generation of bug tracking systems,” in In vl/hcc ’08: proceedings of the 2008 ieee symposium on visual languages and human-centric computing, 2008.
    [Bibtex]
    @inproceedings{just-vlhcc-2008,
    Author = {Sascha Just and Rahul Premraj and Thomas Zimmermann},
    Booktitle = {In VL/HCC '08: Proceedings of the 2008 IEEE Symposium on Visual Languages and Human-Centric Computing},
    Title = {Towards the next generation of bug tracking systems},
    Year = {2008}}

Nov, 2008

ACM Distinguished Paper Award

“What makes a good Bug Report” wins 2008’s ESEC/FSE 18 ACM Distinguished Paper Award!
Nov, 2008

FSE 2008

What Makes a Good Bug Report?

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj and Thomas Zimmermann

In software development, bug reports provide crucial information to developers. However, these reports widely differ in their quality. We conducted a survey among developers and users of APACHE, ECLIPSE, and MOZILLA to find out what makes a good bug report.

The analysis of the 466 responses revealed an information mismatch between what developers need and what users supply. Most developers consider steps to reproduce, stack traces, and test cases as helpful, which are at the same time most difficult to provide for users. Such insight is helpful to design new bug tracking tools that guide users at collecting and providing more helpful information.

Our CUEZILLA prototype is such a tool and measures the quality of new bug reports; it also recommends which elements should be added to improve the quality. We trained CUEZILLA on a sample of 289 bug reports, rated by developers as part of the survey. In our experiments, CUEZILLA was able to predict the quality of 31–48% of bug reports accurately.

  • [DOI] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann, “What makes a good bug report?,” in Proceedings of the 16th acm sigsoft international symposium on foundations of software engineering, New York, NY, USA, 2008, p. 308–318.
    [Bibtex]
    @inproceedings{bettenburg-fse-2008,
    Acmid = {1453146},
    Address = {New York, NY, USA},
    Author = {Bettenburg, Nicolas and Just, Sascha and Schr\"{o}ter, Adrian and Weiss, Cathrin and Premraj, Rahul and Zimmermann, Thomas},
    Booktitle = {Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering},
    Doi = {10.1145/1453101.1453146},
    Isbn = {978-1-59593-995-1},
    Location = {Atlanta, Georgia},
    Numpages = {11},
    Pages = {308--318},
    Publisher = {ACM},
    Series = {SIGSOFT '08/FSE-16},
    Title = {What makes a good bug report?},
    Url = {http://www.st.cs.uni-saarland.de/publications/files/bettenburg-tr-2007.pdf},
    Year = {2008},
    Bdsk-Url-1 = {http://www.st.cs.uni-saarland.de/publications/files/bettenburg-tr-2007.pdf},
    Bdsk-Url-2 = {http://dx.doi.org/10.1145/1453101.1453146}}

Oct, 2007

eTX/OOPSLA 2007

Quality of Bug Reports in Eclipse

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj and Thomas Zimmermann

The information in bug reports influences the speed at which bugs are fixed. However, bug reports differ in their quality of information. We conducted a survey responses among the ECLIPSE developers to determine the information in reports that they widely used and the problems frequently encountered. Our results show that steps to reproduce and stack traces are most sought after by developers, while inaccurate steps to reproduce and incomplete information pose the largest hurdles. Surprisingly, developers are indifferent to bug duplicates. Such insight is useful to design new bug tracking tools that guide reporters at providing more helpful information. We also present a prototype of a quality-meter tool that measures the quality of bug reports by scanning its content.

  • N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann, “Quality of bug reports in eclipse,” in Proceedings of the 2007 oopsla workshop on eclipse technology exchange, New York, NY, USA, 2007.
    [Bibtex]
    @inproceedings{bettenburg-etx-2007,
    Address = {New York, NY, USA},
    Author = {Nicolas Bettenburg and Sascha Just and Adrian Schr\"{o}ter and Cathrin Weiss and Rahul Premraj and Thomas Zimmermann},
    Booktitle = {Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange},
    Link = {http://mozkito.org/modules/issues/its-not-a-bug-its-a-feature-on-the-data-quality-of-bug-databases-icse_2013/},
    Location = {Montreal, Quebec, Canada},
    Month = {October},
    Publisher = {ACM Press},
    Title = {Quality of Bug Reports in Eclipse},
    Url = {http://www.st.cs.uni-saarland.de/publications/files/bettenburg-etx-2007.pdf},
    Year = {2007},
    Bdsk-Url-1 = {http://www.st.cs.uni-saarland.de/publications/files/bettenburg-etx-2007.pdf}}