Cluster-based Sample Selection for Document Image Binarization
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
The current state-of-the-art, in terms of performance, for solving document image binarization is training artificial neural networks on pre-labelled ground truth data. As such, it faces the same issues as other, more conventional, classification problems; requiring a large amount of training data. However, unlike those conventional classification problems, document image binarization involves having to either manually craft or estimate the binarized ground truth data, which can be error-prone and time-consuming. This is where sample selection, the act of selecting training samples based on some method or metric, might help. By reducing the size of the training dataset in such a way that the binarization performance is not impacted, the required time spent creating the ground truth is also reduced. This thesis proposes a cluster-based sample selection method, based on previous work, that uses image similarity metrics and the relative neighbourhood graph to reduce the underlying redundancy of the dataset. The method is implemented with different clustering methods and similarity metrics for comparison, with the best implementation being based on affinity propagation and the structural similarity index. This implementation manages to reduce the training dataset by 46\% while maintaining a performance that is equal to that of the complete dataset. The performance of this method is shown to not be significantly different from randomly selecting the same number of samples. However, due to limitations in the random method, such as unpredictable performance and uncertainty in how many samples to select, the use of sample selection in document image binarization still shows great promise.
Place, publisher, year, edition, pages
2019. , p. 36
Keywords [en]
document image binarization, sample selection, neural networks, computer vision
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-18433OAI: oai:DiVA.org:bth-18433DiVA, id: diva2:1335424
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVACS Master of Science Programme in Computer Science
Supervisors
Examiners
2019-07-052019-07-052019-07-05Bibliographically approved