Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Comparative Study of Layout Analysis of Tabulated Historical Documents
Blekinge Institute of Technology. student.
Blekinge Institute of Technology, Faculty of Computing, Department of Computer Science.ORCID iD: 0000-0002-4390-411x
ArkivDigital AB, SWE.
2021 (English)In: Big Data Research, ISSN 2214-5796, E-ISSN 2214-580X, Vol. 24, article id 100195Article in journal (Refereed) Published
Abstract [en]

Nowadays, the field of multimedia retrieval system has earned a lot of attention as it helps retrieve information more efficiently and accelerates daily tasks. Within this context, image processing techniques such as layout analysis and word recognition play an important role in transcribing content in printed or handwritten documents into digital data that can be further processed. This transcription procedure is called document digitization. This work stems from an industrial need, namely, a Swedish company (ArkivDigital AB) has scanned more than 80 million pages of Swedish historical documents from all over the country and there is a high demand to transcribe the contents into digital data. Such process starts by figuring out text location which, seen from another angle, is merely table layout analysis. In this study, the aim is to reveal the most effective solution to extract document layout w.r.t Swedish handwritten historical documents that are featured by their tabular forms. In short, outcome of public tools (i.e., Breuel's OCRopus method), traditional image processing techniques (e.g., Hessian/Gabor filters, Hough transform, Histograms of oriented gradients -HOG- features), machine learning techniques (e.g., support vector machines, transfer learning) are studied and compared. Results show that the existing OCR tool cannot carry layout analysis task on our Swedish historical handwritten documents. Traditional image processing techniques are mildly capable of extracting the general table layout in these documents, but the accuracy is enhanced by introducing machine learning techniques. The best performing approach will be used in our future document mining research to allow for the development of scalable resource-efficient systems for big data analytics. © 2021 Elsevier Inc.

Place, publisher, year, edition, pages
Elsevier Inc. , 2021. Vol. 24, article id 100195
Keywords [en]
Feature extraction, Historical handwritten documents, Image processing, Layout analysis, Machine learning
National Category
Computer Sciences Computer Vision and Robotics (Autonomous Systems)
Identifiers
URN: urn:nbn:se:bth-20991DOI: 10.1016/j.bdr.2021.100195ISI: 000642459200009OAI: oai:DiVA.org:bth-20991DiVA, id: diva2:1524833
Projects
Scalable resource-efficient systems for big data analytics
Funder
Knowledge Foundation, 20140032Available from: 2021-02-02 Created: 2021-02-02 Last updated: 2021-08-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Cheddad, Abbas

Search in DiVA

By author/editor
Cheddad, Abbas
By organisation
Blekinge Institute of TechnologyDepartment of Computer Science
In the same journal
Big Data Research
Computer SciencesComputer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 228 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf