Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Audio Moment Retrieval based on Natural Language Query
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2020 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Background. Users spend a lot of time searching through media content to find the desirable fragment. Most of the time people can describe verbally what they are looking for but there is not much of a use for that as of today. Using that verbal description as a query to search for the right interval in a given audio sample would save people a lot of time.

Objectives. The aim of this thesis is to compare the performance of the methods suitable for retrieving desired intervals from an audio of an arbitrary length using a natural language query input. There are two objectives. The first one is to train models that match a natural language input to the specific interval of a given soundtrack. The second one is to evaluate the models' performance using conventional metrics.

Methods. The research method used in this research is mixed. Various literature on the existing methods suitable for audio classification was reviewed. Three models were selected for conducting the experiments. The selected models are YamNet, AlexNet and ResNet-50. Two experiments were conducted. The goal of the first experiment was to measure the models' performance on classifying audio samples. The goal of the second experiment was to measure the same models' performance on the audio intervals retrieval problem which uses classification as a part of the approach. The steps taken to conduct the experiments were reported as well as the statistical data obtained as a result of the experiments. These steps include data collection, data preprocessing, models training and their performance evaluation.

Results. The two tests were conducted to see which model performs better on two separate problems - audio classification and intervals retrieval based on a natural language query. The statistical data was obtained as a result of the tests. The degree (performance-wise) to which can we match a natural language query input to a corresponding interval of an audio of an arbitrary length was calculated for each of the selected models. The aggregated performance of the models are mostly comparable, with YamNet occasionally outperforming the other two models. The average Area Under the Curve, and Accuracy for the studied models are as follows: (67, 71.62), (68.99, 67.72) and (66.59, 71.93) for YamNet, AlexNet and ResNet-50, respectively.

Conclusions. We have discovered that the tested models were not capable of retrieving intervals from an audio of an arbitrary length based on a natural language query, however the degree to which the models are able to retrieve the intervals varies depending on the queried keyword and other hyperparameters such as the value of the threshold that is used to filter the audio patches that yield too low probability of the queried class.

Place, publisher, year, edition, pages
2020. , p. 50
Keywords [en]
Deep Learning, Intervals Retrieval, Natural Language Query, Audio Classification
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-20094OAI: oai:DiVA.org:bth-20094DiVA, id: diva2:1450868
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVACS Master of Science Programme in Computer Science
Supervisors
Examiners
Available from: 2020-08-03 Created: 2020-07-01 Last updated: 2020-08-03Bibliographically approved

Open Access in DiVA

Audio Moment Retrieval based on Natural Language Query(736 kB)518 downloads
File information
File name FULLTEXT02.pdfFile size 736 kBChecksum SHA-512
4b36e02e27726fdcc512340a4671633d91efdd67c9b1143b173f8da0c9b41c569930a25277f12189550bbceec8a1f29ac0377bd48a85e4f0cad24e9cf9761106
Type fulltextMimetype application/pdf

By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 518 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 331 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf