Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating Large Language Models vs Traditional Machine Learning Models in Classifying Automotive Reports
Blekinge Institute of Technology, Faculty of Computing.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2025 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Background. Efficient classification of mixed-format automotive reports is essential for improving scalability and operational efficiency in industries that rely heavily on diverse datasets. The complexity arises from handling structured numerical and categorical data along with unstructured text, necessitating robust and adaptable AI methods.

Objectives. This study aims to evaluate and compare traditional machine learning models with large language models for classifying automotive reports, particularly addressing the challenge of data imbalance and mixed-format data types.

Methods. A structured experimental design was utilized, incorporating traditional machine learning algorithms such as logistic regression, random forest, and LinearSVC, as well as advanced large language models with various prompting strategies, including retrieval-augmented generation. Data preprocessing involved text cleaning, translation, lemmatization, term frequency–inverse document frequency vectorization, and several balancing techniques, including synonym replacement methods (basic, embedding-based, and few-shot).

Results. Results indicate that large language models, particularly with retrieval-augmented generation-based prompting and clear target descriptions, achieved significant improvements in classification accuracy over baseline traditional machine learning models. The best-performing scenario combined detailed label descriptions with recent retrieval-augmented generation examples, achieving up to 87% accuracy. Traditional machine learning models showed consistent performance but required extensive preprocessing and careful feature engineering.

Conclusions. Large language models demonstrate substantial potential in accurately classifying automotive reports when enhanced with contextually relevant examples through retrieval-augmented generation. However, careful management of prompt complexity and the quality of historical data used for examples is essential. Future research should further investigate embedding quality and extend evaluation across different industries.

Abstract [sv]

Bakgrund. Effektiv klassificering av fordonsrapporter med blandade format är avgörande för att förbättra skalbarhet och operativ effektivitet i industrier beroende av olika typer av datauppsättningar. Utmaningen ligger i att hantera strukturerade numeriska och kategoriska data samt ostrukturerade texter, vilket kräver robusta och anpassningsbara AI-metoder.

Syfte. Syftet med studien är att utvärdera och jämföra traditionella maskininlärningsmodeller med stora språkmodeller för klassificering av fordonsrapporter, med särskilt fokus på att hantera dataobalans och data med blandade format.

Metod. En strukturerad experimentdesign användes med traditionella maskininlärningsmodeller såsom logistisk regression, random forest och LinearSVC, samt avancerade stora språkmodeller med olika promptstrategier inklusive RAG. Datapreprocessering omfattade textrengöring, översättning, lemmatisering, TF-IDF-vektorisering samt flera balanseringstekniker, inklusive synonymersättningsmetoder (grundläggande, inbäddningsbaserade och few-shot).

Resultat. Resultaten visar att stora språkmodeller, särskilt med RAG-baserad promptning och tydliga målbeskrivningar, uppnådde betydande förbättringar i klassificeringsnoggrannhet jämfört med traditionella maskininlärningsmodeller. Det bäst presterande scenariot kombinerade detaljerade etikettbeskrivningar med nyliga RAG-exempel och nådde upp till 87% noggrannhet. Traditionella maskininlärningsmodeller visade stabila resultat men krävde omfattande förbearbetning och noggrann feature engineering.

Slutsatser. Stora språkmodeller visar stor potential för noggrann klassificering av fordonsrapporter när de förstärks med kontextuellt relevanta exempel via RAG. Emellertid är noggrann hantering av promptkomplexitet och kvaliteten på historiska data avgörande. Framtida forskning bör ytterligare undersöka kvaliteten på inbäddningar och utöka utvärderingen till andra branscher

Place, publisher, year, edition, pages
2025. , p. 70
Keywords [en]
Report classification, Large Language Models, Traditional Machine Learning Models, Automotive, Data imbalance
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-28054OAI: oai:DiVA.org:bth-28054DiVA, id: diva2:1966981
External cooperation
Volvo Cars
Subject / course
Degree Project in Master of Science in Engineering 30,0 hp
Educational program
DVAMI Master of Science in Engineering: AI and Machine Learning 300 hp
Supervisors
Examiners
Available from: 2025-06-16 Created: 2025-06-11 Last updated: 2025-09-30Bibliographically approved

Open Access in DiVA

fulltext(1952 kB)275 downloads
File information
File name FULLTEXT01.pdfFile size 1952 kBChecksum SHA-512
61a8c5317fc1a06bd2da4135bf8942b45f040d3cfbf606776742db6a0391fe4166d3ea97d6f5d34c3f5dde5e8fd1a96b617478ed89e669535bc1a3b088228299
Type fulltextMimetype application/pdf

By organisation
Faculty of ComputingDepartment of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 275 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 426 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf