Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
CArDIS: A Swedish Historical Handwritten Character and Word Dataset for OCR
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.
2022 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Background: To preserve valuable sources and cultural heritage, digitization of handwritten characters is crucial. For this, Optical Character Recognition (OCR) systems were introduced and most widely used to recognize digital characters. Incase of ancient or historical characters, automatic transcription is more challenging due to lack of data, high complexity and low quality of the resource. To solve these problems, multiple image based handwritten dataset were collected from historicaland modern document images. But these dataset also have some limitations. To overcome the limitations, we were inspired to create a new image-based historical handwritten character and word dataset and evaluate it’s performance using machine learning algorithms.

Objectives: The main objective of this thesis is to create a first ever Swedish historical handwritten character and word dataset named CArDIS (Character Arkiv Digital Sweden) which will be publicly available for further research. In addition,verify the correctness of the dataset and perform a quantitative analysis using different machine learning methods.

Methods: Initially we searched for existing character dataset to know how modern character dataset differs from the historical handwritten dataset. We have performed literature review to learn about most commonly used dataset for OCR. On the other hand, we have also studied different machine learning algorithms and their applica-tions. Finally, we have trained six different machine learning methods namely Support Vector Machine, k-Nearest Neighbor, Convolutional Neural Network, Recurrent Neural Network, Random Forest, SVM-HOG with existing dataset and newly created dataset to evaluate the performance and efficiency of recognizing ancient handwritten characters.

Results: The performance/evaluation results show that the machine learning classifiers struggle to recognise the ancient handwritten characters with less recognition accuracy. Out of which CNN outperforms with highest recognition accuracy.

Conclusions: The current thesis introduces first ever newly created historical hand-written character and word dataset in Swedish named CArDIS. The character dataset contains 1,01,500 Latin and Swedish character images belonging to 29 classes while the word dataset contains 10,000 word images containing ten popular Swedish names belonging to 10 classes in RGB color space. Also, the performance of six machine learning classifiers on CArDIS and existing datasets have been reported. The thesis concludes that classifiers when trained on existing dataset and tested on CArDIS dataset show low recognition accuracy proving that, the CArDIS dataset have unique characteristics and features over the existing handwritten datasets. Finally, this re-search provided a first Swedish character and word dataset, which is robust with a proven accuracy; also it is publicly available for further research.

Place, publisher, year, edition, pages
2022. , p. 51
Keywords [en]
Handwritten Text Recognition, Optical Character Recognition, Machine learning methods, historical handwritten character recognition, handwritten character dataset
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-22840OAI: oai:DiVA.org:bth-22840DiVA, id: diva2:1652102
Subject / course
DV2572 Master´s Thesis in Computer Science; DV2572 Master´s Thesis in Computer Science
Educational program
DVADA Master Qualification Plan in Computer Science; DVACO Master's program in computer science 120,0 hp
Supervisors
Examiners
Available from: 2022-04-20 Created: 2022-04-14 Last updated: 2025-09-30Bibliographically approved

Open Access in DiVA

CArDIS: A Swedish Historical Handwritten Character and Word Dataset for OCR(2221 kB)714 downloads
File information
File name FULLTEXT02.pdfFile size 2221 kBChecksum SHA-512
2485d1afe2be3b50972f267cf56d0d3a4c08bb2f5e413fe42857b3c55aefadffa161d2f7942f155d9856933b9bc6179689c12d73058d7679b6d16187b7c77542
Type fulltextMimetype application/pdf

By organisation
Department of Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 715 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 750 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf