Practical Considerations and Solutions in NLP-Based Analysis of Code Review Comments - An Experience Report
2025 (English)In: Product-Focused Software Process Improvement / [ed] Dietmar Pfahl, Javier Gonzalez Huerta, Jil Klünder, Hina Anwar, Springer Science+Business Media B.V., 2025, Vol. 15452, p. 342-351Conference paper, Published paper (Refereed)
Abstract [en]
Context: Automated analysis of code review comments (CRCs) can aid in highlighting frequently discussed issues by reviewers from large repositories. Topic modeling is a promising approach to analyzing large natural language repositories. However, CRCs contain natural language text and code references; thus, data pre-processing and topic modeling approaches must be carefully selected.
Objective: This work aims to discuss the various decisions taken and considerations involved in the analysis of CRCs.
Method: We utilized 5,560 CRCs from an open-source system to study the decisions and considerations faced during the analysis of CRCs using topic modeling, followed by an evaluation of the interpretability of identified themes by a domain expert.
Results: We report several observations and challenges in improving the quality of the identified themes, including choices regarding the pre-processing, topic modeling parameters, embedding model, and objective measures of coherence used, which impact the subjective interpretability of the identified themes.
Conclusions: This work offers unique considerations, and the impact of these decisions can facilitate future studies in conducting topic modeling-based analyses of CRCs. Future studies can utilize the technical demonstrator to explore the interpretability of the topics generated from CRCs.
Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2025. Vol. 15452, p. 342-351
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 15452
Keywords [en]
Natural language processing systems, Open source software, Open systems, Automated analysis, Code review, Data preprocessing, Experience report, Interpretability, Modeling approach, Natural languages, Natural languages texts, Processing model, Topic Modeling, Modeling languages
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-27330DOI: 10.1007/978-3-031-78386-9_24ISI: 001423664600024Scopus ID: 2-s2.0-85211908780ISBN: 9783031783852 (print)OAI: oai:DiVA.org:bth-27330DiVA, id: diva2:1923722
Conference
25th International Conference on Product-Focused Software Process Improvement, PROFES 2024, Tartu, Dec 2-4, 2024
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications2024-12-302024-12-302025-09-30Bibliographically approved