1231 of 3
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Quality Evaluation of Generative AI Systems: Processes, Metrics, Methods, and Frameworks for Industrial Software Engineering
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.ORCID iD: 0000-0001-5949-1375
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Generative Artificial Intelligence (GenAI) is being rapidly adopted in software engineering, introducing a paradigm shift toward human-AI co-creation. However, the non-deterministic, probabilistic, and often black-box nature of GenAI models presents challenges for traditional software quality assurance. Conventional verification and validation techniques are insufficient to handle outputs that are neither predictably correct nor incorrect, but rather stochastically plausible. This discrepancy creates an urgent need for practical processes, metrics, and new governance frameworks to evaluate and manage the quality of GenAI systems in industrial environments.This thesis examines how industrial organizations adopt GenAI, identify metrics, and evaluate system qualities in alignment with ISO quality standards. Case studies were employed to explore real-world adoption processes, identify context-specific industrial metrics, and uncover practical insights within organizations. A snowballing literature review was conducted to systematically identify, categorize, and synthesize academic metrics for evaluating the output of GenAI systems. Finally, a controlled experiment was designed to quantitatively test the efficiency (e.g., E2E generation time) and effectiveness (e.g., accuracy) of GenAI agent choices. The main contributions of this thesis are a synthesized actionable model and framework grounded in both industrial practice and quality standards. The first contribution is a four-stage adoption model, denoted as the IMRM model (Innovate → considerations, Measure → metrics, Realize → values, Manage → improvements) that integrates early-stage risk assessment (e.g., legal, security, and licensing) andquality evaluation throughout the GenAI adoption and usage.The second contribution presents a detailed framework that connects risks andmetrics to concrete decision support, justifying the business value (e.g., quality gates) and technical trade-offs of GenAI solutions. The third contribution provides a structured mapping of GenAI quality to ISO/IEC 25010, 25023, and 25059 characteristics, attempting to ground practical evaluation needs within a standardized vocabulary. This thesis concludes that a structured quality evaluation process, which prioritizes risks and context, is a valuable approach intended to support building the business confidence required to leverage GenAI for efficient and effective software engineering in industry.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Tekniska Högskola, 2026. , p. 232
Series
Blekinge Institute of Technology Doctoral Dissertation Series, ISSN 1653-2090 ; 2026:01
Keywords [en]
Quality Evaluation, Metrics, Artificial Intelligence, AI, Generative AI, Empirical Software Engineering
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-28958OAI: oai:DiVA.org:bth-28958DiVA, id: diva2:2018519
Public defence
2026-01-29, J1630, Karlskrona, 13:00 (English)
Opponent
Supervisors
Available from: 2025-12-08 Created: 2025-12-03 Last updated: 2025-12-18Bibliographically approved
List of papers
1. Experience with Large Language Model Applications for Information Retrieval from Enterprise Proprietary Data
Open this publication in new window or tab >>Experience with Large Language Model Applications for Information Retrieval from Enterprise Proprietary Data
2025 (English)In: Product-Focused Software Process Improvement / [ed] Dietmar Pfahl, Javier Gonzalez Huerta, Jil Klünder, Hina Anwar, Springer, 2025, Vol. 15452, p. 92-107Conference paper, Published paper (Refereed)
Abstract [en]

Large Language Models (LLMs) offer promising capabilities for information retrieval and processing. However, the LLM deployment for querying proprietary enterprise data poses unique challenges, particularly for companies with strict data security policies. This study shares our experience in setting up a secure LLM environment within a FinTech company and utilizing it for enterprise information retrieval while adhering to data privacy protocols. 

We conducted three workshops and 30 interviews with industrial engineers to gather data and requirements. The interviews further enriched the insights collected from the workshops. We report the steps to deploy an LLM solution in an industrial sandboxed environment and lessons learned from the experience. These lessons contain LLM configuration (e.g., chunk_size and top_k settings), local document ingestion, and evaluating LLM outputs.

Our lessons learned serve as a practical guide for practitioners seeking to use private data with LLMs to achieve better usability, improve user experiences, or explore new business opportunities. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Place, publisher, year, edition, pages
Springer, 2025
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 15452
Keywords
AI, Artificial intelligence, Data security, Information retrieval, Large Language Model, LLM, Sandbox environment, Data privacy, Fintech, Enterprise data, Language model, Model application, Modeling environments, Privacy protocols, Security policy, Structured Query Language
National Category
Software Engineering
Identifiers
urn:nbn:se:bth-27326 (URN)10.1007/978-3-031-78386-9_7 (DOI)001423664600007 ()2-s2.0-85211960724 (Scopus ID)9783031783852 (ISBN)
Conference
25th International Conference on Product-Focused Software Process Improvement, PROFES 2024, Tartu, Dec 2-4, 2024
Funder
Knowledge Foundation, 20180010
Available from: 2024-12-28 Created: 2024-12-28 Last updated: 2025-12-03Bibliographically approved
2. Measuring the quality of generative AI systems: Mapping metrics to quality characteristics — Snowballing literature review
Open this publication in new window or tab >>Measuring the quality of generative AI systems: Mapping metrics to quality characteristics — Snowballing literature review
2025 (English)In: Information and Software Technology, ISSN 0950-5849, E-ISSN 1873-6025, Vol. 186, article id 107802Article, review/survey (Refereed) Published
Abstract [en]

Context: Generative Artificial Intelligence (GenAI) and the use of Large Language Models (LLMs) have revolutionized tasks that previously required significant human effort, which has attracted considerable interest from industry stakeholders. This growing interest has accelerated the integration of AI models into various industrial applications. However, the model integration introduces challenges to product quality, as conventional quality measuring methods may fail to assess GenAI systems. Consequently, evaluation techniques for GenAI systems need to be adapted and refined. Examining the current state and applicability of evaluation techniques for the GenAI system outputs is essential.

Objective: This study aims to explore the current metrics, methods, and processes for assessing the outputs of GenAI systems and the potential of risky outputs.

Method: We performed a snowballing literature review to identify metrics, evaluation methods, and evaluation processes from 43 selected papers.

Results: We identified 28 metrics and mapped these metrics to four quality characteristics defined by the ISO/IEC 25023 standard for software systems. Additionally, we discovered three types of evaluation methods to measure the quality of system outputs and a three-step process to assess faulty system outputs. Based on these insights, we suggested a five-step framework for measuring system quality while utilizing GenAI models.

Conclusion: Our findings present a mapping that visualizes candidate metrics to be selected for measuring quality characteristics of GenAI systems, accompanied by step-by-step processes to assist practitioners in conducting quality assessments. 

Place, publisher, year, edition, pages
Elsevier, 2025
Keywords
Evaluation, GenAI, Generative AI, Large language model, LLM, Metric, Quality characteristics, Artificial intelligence, Computer software, ISO Standards, Mapping, Quality control, Artificial intelligence systems, Generative artificial intelligence, Language model, Quality characteristic, Reviews
National Category
Artificial Intelligence
Identifiers
urn:nbn:se:bth-28306 (URN)10.1016/j.infsof.2025.107802 (DOI)001519902000001 ()2-s2.0-105008505516 (Scopus ID)
Funder
Knowledge Foundation, 20180010
Available from: 2025-07-04 Created: 2025-07-04 Last updated: 2025-12-03Bibliographically approved
3. Evaluating the Quality of GenAI Applications in Software Engineering: A Multi-case Study
Open this publication in new window or tab >>Evaluating the Quality of GenAI Applications in Software Engineering: A Multi-case Study
2026 (English)In: Empirical Software Engineering, ISSN 1382-3256, E-ISSN 1573-7616, Vol. 31, no 2, article id 29Article in journal (Refereed) Published
Abstract [en]

Context: Generative AI (GenAI) is increasingly adopted in software development for tasks such as document generation, data analysis, and code generation.However, evaluating the quality of GenAI applications becomes challenging, as traditional quality measurements may not be fully applicable.

Objective: In this study, we explore how practitioners evaluate the quality of GenAI applications and investigate quality evaluation techniques.

Method: We conducted a multi-case study in three industrial projects from software development companies.We examined four GenAI application domains: document generation, data analysis and insight generation, customer service, and code generation.Data were collected through three workshops and 23 semi-structured interviews with industrial practitioners.

Results: We identified fourteen GenAI use cases and 28 metrics currently used to evaluate the quality of GenAI applications' outputs.We synthesized the identified metrics' usage patterns and challenges based on the collected data.

Conclusions: This study presents practical insights into using metrics to measure GenAI-based system qualities in real industrial settings.Our findings indicate that practitioners use custom-built and context‑specific metrics; combining these with academic metrics can strengthen GenAI system quality evaluation.

Place, publisher, year, edition, pages
Springer, 2026
Keywords
GenAI, Generative artificial intelligence, Large language model, LLM, Metric, Quality evaluation
National Category
Software Engineering Artificial Intelligence
Identifiers
urn:nbn:se:bth-28954 (URN)10.1007/s10664-025-10759-2 (DOI)001632325800004 ()2-s2.0-105024070431 (Scopus ID)
Funder
Knowledge Foundation, 20180010
Available from: 2025-12-03 Created: 2025-12-03 Last updated: 2026-01-05Bibliographically approved
4. Paradigm shift on Coding Productivity Using GenAI
Open this publication in new window or tab >>Paradigm shift on Coding Productivity Using GenAI
2025 (English)In: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering , EASE, 2025 edition, EASE 2025, Association for Computing Machinery (ACM), 2025, p. 708-713Conference paper, Published paper (Refereed)
Abstract [en]

Generative AI (GenAI) applications are transforming software engineering by enabling automated code co-creation. However, empirical evidence on GenAI's productivity effects in industrial settings remains limited. This paper investigates the adoption of GenAI coding assistants (e.g., Codeium, Amazon Q) within telecommunications and FinTech domains. Through surveys and interviews with industrial domain experts, we identify primary productivity-influencing factors, including task complexity, coding skills, domain knowledge, and GenAI integration. Our findings indicate that GenAI tools enhance productivity in routine coding tasks (e.g., refactoring and Javadoc generation) but face challenges in complex, domain-specific activities due to limited context-awareness of codebases and insufficient support for customized design rules. We highlight new paradigms for coding transfer, emphasizing iterative prompt refinement, an immersive development environment, and automated code evaluation as essential for effective GenAI usage.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2025
Keywords
AI4SE, GenAI, Generative AI, Productivity, Software Evaluation
National Category
Software Engineering
Identifiers
urn:nbn:se:bth-28955 (URN)10.1145/3756681.3757081 (DOI)2-s2.0-105026936161 (Scopus ID)
Conference
29th International Conference on Evaluation and Assessment in Software Engineering, EASE 2025, Istanbul, June 17-20, 2025
Funder
Knowledge Foundation, 20180010
Available from: 2025-12-03 Created: 2025-12-03 Last updated: 2026-01-23Bibliographically approved
5. A Framework for Evaluating GenAI Adoption and Use in Software Engineering
Open this publication in new window or tab >>A Framework for Evaluating GenAI Adoption and Use in Software Engineering
(English)Manuscript (preprint) (Other academic)
National Category
Software Engineering
Identifiers
urn:nbn:se:bth-28956 (URN)
Available from: 2025-12-03 Created: 2025-12-03 Last updated: 2025-12-04Bibliographically approved
6. Evaluating the Sufficiency of Single-Agent LLM Systems for Algorithmic Problem Solving in Support and Operations
Open this publication in new window or tab >>Evaluating the Sufficiency of Single-Agent LLM Systems for Algorithmic Problem Solving in Support and Operations
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Support and operations engineers frequently develop lightweight scripts to parse logs or reconcile data --- tasks that rely on algorithmic code patterns. While Multi-Agent systems are often advocated for their reliability, it remains unclear if their coordination overhead is justified for these self-contained tasks. We analyzed 27000+ industrial source files to validate the prevalence of small, common, but non-trivial algorithmic patterns used in practice (hash maps, string operations, sorting, and simple tree or graph traversals).We then conducted a controlled experiment on 150 LeetCode algorithmic problems that are representative proxies for these patterns, using four state-of-the-art LLMs (GPT-5.1, Claude-4.5, Deepseek-chat-v3.1, Gemini-2.5-pro). Contrary to the `more is better' assumption in AI agents, we observed a capability saturation effect: the evaluated models already solved most tasks with a single-turn prompt, leaving limited room for Multi-Agent orchestration to improve acceptance. Single-Agent baselines achieved high acceptance rates (95--99\%), statistically indistinguishable from Multi-Agent systems.However, the Multi-Agent approach introduced an increase in latency and token cost without yielding meaningful quality gains in code quality, as measured by cyclomatic complexity and lines of code.Our study results indicate that for well-defined and self-contained algorithmic scripting tasks, the four selected LLMs have crossed a sufficiency threshold: they already provide suitable single-turn solutions, so extra agent orchestration is not required for these tasks.Although the single-agent solution performed equally to the multi-agent solution in the context of this study, the results provide no evidence to suggest that similar behaviors should be observed for other, or more general, software engineering tasks.

National Category
Software Engineering
Identifiers
urn:nbn:se:bth-28957 (URN)
Available from: 2025-12-03 Created: 2025-12-03 Last updated: 2025-12-04Bibliographically approved

Open Access in DiVA

fulltext(15454 kB)416 downloads
File information
File name FULLTEXT01.pdfFile size 15454 kBChecksum SHA-512
daaaca4652b4add103c89387b0d2e1ce6ea0e429ad44694f6a0790cfbb7bd241f00a368f913b87b67d87076820cb769ee65baa55543c02d905a0cbe89a0c2d74
Type fulltextMimetype application/pdf

Authority records

Yu, Liang

Search in DiVA

By author/editor
Yu, Liang
By organisation
Department of Software Engineering
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 2158 hits
1231 of 3
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf