Planned maintenance
A system upgrade is planned for 24/9-2024, at 12:00-14:00. During this time DiVA will be unavailable.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
CArDIS: A Swedish Historical Handwritten Character and Word Dataset
KTO Karatay Univ, TUR.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.ORCID iD: 0000-0001-7536-3349
Univ Witwatersrand, ZAF.
Blekinge Institute of Technology. student.
Show others and affiliations
2022 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 10, p. 55338-55349Article in journal (Refereed) Published
Abstract [en]

This paper introduces a new publicly available image-based Swedish historical handwritten character and word dataset named Character Arkiv Digital Sweden (CArDIS) (https://cardisdataset.github.io/CARDIS/). The samples in CArDIS are collected from 64, 084 Swedish historical documents written by several anonymous priests between 1800 and 1900. The dataset contains 116, 000 Swedish alphabet images in RGB color space with 29 classes, whereas the word dataset contains 30, 000 image samples of ten popular Swedish names as well as 1, 000 region names in Sweden. To examine the performance of different machine learning classifiers on CArDIS dataset, three different experiments are conducted. In the first experiment, classifiers such as Support Vector Machine (SVM), Artificial Neural Networks (ANN), k-Nearest Neighbor (k-NN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Random Forest (RF) are trained on existing character datasets which are Extended Modified National Institute of Standards and Technology (EMNIST), IAM and CVL and tested on CArDIS dataset. In the second and third experiments, the same classifiers as well as two pre-trained VGG-16 and VGG-19 classifiers are trained and tested on CArDIS character and word datasets. The experiments show that the machine learning methods trained on existing handwritten character datasets struggle to recognize characters efficiently on the CArDIS dataset, proving that characters in the CArDIS contain unique features and characteristics. Moreover, in the last two experiments, the deep learning-based classifiers provide the best recognition rates.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC , 2022. Vol. 10, p. 55338-55349
Keywords [en]
Character recognition, Optical character recognition software, Feature extraction, Hidden Markov models, Handwriting recognition, Machine learning, Image recognition, Character and word recognition, machine learning methods, optical character recognition (OCR), old handwritten style, Swedish handwritten character dataset, Swedish handwritten word dataset
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-23171DOI: 10.1109/ACCESS.2022.3175197ISI: 000804633200001OAI: oai:DiVA.org:bth-23171DiVA, id: diva2:1670902
Part of project
Bigdata@BTH- Scalable resource-efficient systems for big data analytics, Knowledge Foundation
Funder
Knowledge Foundation, 20140032
Note

open access

Available from: 2022-06-16 Created: 2022-06-16 Last updated: 2022-06-16Bibliographically approved

Open Access in DiVA

fulltext(2364 kB)511 downloads
File information
File name FULLTEXT01.pdfFile size 2364 kBChecksum SHA-512
7717e1300b985b4d90e316455366a696f1d483e788d9add288ad7ba9bdf86a3bf456ea80332dbaeb704256959f8c04f9658e882f49c5d97cc6cc1d9b92f8afe9
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Kusetogullari, Hüseyin

Search in DiVA

By author/editor
Kusetogullari, Hüseyin
By organisation
Department of Computer ScienceBlekinge Institute of Technology
In the same journal
IEEE Access
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 511 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 552 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf