Performance Analysis of Lightweight DNN Models for Embedded Speech Recognition: The Impact of Generative AI-Augmented Data
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background: Speech recognition technology holds great promise for embedded systems with limited resources, allowing machines to interpret spoken language effectively. Traditional models like Hidden Markov Models (HMMs) often struggle in noisy environments, whereas modern deep neural networks (DNNs) excel by learning complex speech patterns. However, embedded devices must navigate the challenges of balancing accuracy with computational constraints. Data augmentation using Generative AI techniques, such as adding noise, is a key method to enhance model robustness. This thesis evaluates speech recognition models and aims to improve their performance in noisy and resource-constrained environments.
Objectives: The objectives are to evaluate lightweight DNN models on resource constraints of embedded devices and select the most efficient one. Additionally, the study aims to enhance the chosen model’s speech-to-text performance by using GAN-based data augmentation to improve accuracy and robustness in noisy environments.
Method: This research utilizes an experimental approach to develop a speech recognition system for an embedded speaker with an inbuilt microphone. It assesses lightweight Deep Neural Network (DNN) models based on CPU usage, disk space, and latency. The study employs Generative AI (SimuGAN) to augment the Libri Speech dataset with noisy data, and the best-performing model is fine-tuned to improve speech-to-text accuracy in noisy real-world conditions.
Results: The comparative analysis of the three pre-trained models (Deep Speech, Wav2Vec 2.0, and Vosk) showed that Vosk was the most efficient with lower CPU usage (24%) and latency (4.3s). SimuGAN was used to augment the clean LibriSpeech dataset with noisy data, effectively simulating real-world conditions. Fine-tuning the Vosk model with this noisy augmented data reduced the Word Error Rate (WER) from 40.3% to 33.8%, demonstrating improved performance and robustness in handling noisy speech environments.
Conclusion: Vosk is the best lightweight DNN model for speech recognition on the selected embedded speaker. Integrating Generative AI techniques, especially SimuGAN, significantly enhanced Vosk’s performance by improving its adaptability to noisy real-world conditions. These findings underscore the importance of efficient and robust models in speech recognition for embedded systems.
Place, publisher, year, edition, pages
2024. , p. 40
Keywords [en]
Speech Recognition, Lightweight DNN models, Embedded systems, Generative AI, Data Augmentation, SimuGAN.
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-27235OAI: oai:DiVA.org:bth-27235DiVA, id: diva2:1921046
External cooperation
Axis Communications AB
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVADA Master Qualification Plan in Computer Science
Presentation
(English)
Supervisors
Examiners
2025-01-032024-12-132025-09-30Bibliographically approved