Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Cluster-based Sample Selection for Document Image Binarization
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The current state-of-the-art, in terms of performance, for solving document image binarization is training artificial neural networks on pre-labelled ground truth data. As such, it faces the same issues as other, more conventional, classification problems; requiring a large amount of training data. However, unlike those conventional classification problems, document image binarization involves having to either manually craft or estimate the binarized ground truth data, which can be error-prone and time-consuming. This is where sample selection, the act of selecting training samples based on some method or metric, might help. By reducing the size of the training dataset in such a way that the binarization performance is not impacted, the required time spent creating the ground truth is also reduced. This thesis proposes a cluster-based sample selection method, based on previous work, that uses image similarity metrics and the relative neighbourhood graph to reduce the underlying redundancy of the dataset. The method is implemented with different clustering methods and similarity metrics for comparison, with the best implementation being based on affinity propagation and the structural similarity index. This implementation manages to reduce the training dataset by 46\% while maintaining a performance that is equal to that of the complete dataset. The performance of this method is shown to not be significantly different from randomly selecting the same number of samples. However, due to limitations in the random method, such as unpredictable performance and uncertainty in how many samples to select, the use of sample selection in document image binarization still shows great promise.

Place, publisher, year, edition, pages
2019. , p. 36
Keywords [en]
document image binarization, sample selection, neural networks, computer vision
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-18433OAI: oai:DiVA.org:bth-18433DiVA, id: diva2:1335424
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVACS Master of Science Programme in Computer Science
Supervisors
Examiners
Available from: 2019-07-05 Created: 2019-07-05 Last updated: 2019-07-05Bibliographically approved

Open Access in DiVA

fulltext(1184 kB)12 downloads
File information
File name FULLTEXT02.pdfFile size 1184 kBChecksum SHA-512
e7f40e365f015c16ee605f318bcf7795497dd9c0782ce2ea764dd5b641906b253a54ab8cb17ffed82592b0686c48ba5fc17b3f3e31fa37d92bcc4afb3fafccf0
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Krantz, Amandus
By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 12 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 55 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf