Evaluating Large Language Models for User Story Mining in Technology News
2025 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background. Starting requirements elicitation for a project remains a significantchallenge in software development; most time organizations rely on structured inputsor direct stakeholder interaction to derive their user stories. Unstructured sources,such as news articles, represent a potentially rich but underutilized resource foridentifying emerging user needs and technological trends. Large Language Models(LLMs) offer a new possibility for automating the extraction of requirements artifacts from such articles.
Objective. This research work investigates the effectiveness of using state-ofthe-art LLMs to generate user stories from Information Technology (IT) news articles. It further compares the quality and characteristics of the LLM-generated userstories against those authored by human practitioners in the field of requirementsengineering and development, with the aim of understanding their potential utilityin downstream development activities.
Methodology. We employed a mixed-methods approach, three prominent LLMs(Grok-3-Preview-02-24, Gemini-2.0-Pro-Exp-02-05, ChatGPT-4o-latest) were promptedusing a standardized template created to extract user stories from a selected andfiltered set of IT news articles. These outputs were compared with user stories generated independently by human experts working with the same articles. The comparison involved automated quality assessment using the AQUAS framework andreadability metrics, alongside a blind evaluation conducted by human experts whorated the stories based on criteria that include Decomposition Potential, Clarity andSpecificity, Traceability to Source, Innovation, and Overall Development Utility.
Results. Our findings indicate that LLMs can effectively generate a substantial volume of syntactically good user stories from news articles, often exceedingthe quantity produced by human experts. However, the model performance varied,with Grok and ChatGPT demonstrating a stronger adherence to instructions andsyntactic quality than Gemini. Also, a significant qualitative differences were observed as LLM-generated stories tended to be more technically detailed, atomic, andclosely tied to the source text’s implementation specifics. On the other hand, humanauthored stories were often more strategic, contextual, and occasionally combinedrelated needs. In the blind human expert evaluation, LLM-generated user storieswere consistently rated significantly higher than the human-authored ones across allassessed dimensions, suggesting a high perceived value for subsequent developmenttasks.
Conclusion. We agree that LLMs demonstrate considerable potential as toolsto augment requirements elicitation by rapidly generating detailed, candidate userstories from unstructured text like IT news articles. While careful model selectionand human oversight are crucial for validation, refinement, and contextualization,the structural and detailed nature of LLM outputs appears highly beneficial fordownstream requirements engineering activities. LLMs and human experts offercomplementary strengths, which suggests a hybrid approach may be most effectivefor leveraging this technology in practice.
Place, publisher, year, edition, pages
2025. , p. 65
Keywords [en]
LLM, Technology News, Requirement Mining, Requirement Elicitation, User Story, Comparative Analysis
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:bth-28200OAI: oai:DiVA.org:bth-28200DiVA, id: diva2:1977131
Subject / course
PA2534 Master's Thesis (120 credits) in Software Engineering
Educational program
PAADA Master Qualification Plan in Software Engineering 120,0 hp
Presentation
2025-05-28, J1650, Valhallavägen 1, Karlskrona, 16:53 (English)
Supervisors
Examiners
2025-06-302025-06-252025-09-30Bibliographically approved