Comparative study of table layout analysis: Layout analysis solutions study for Swedish historical hand-written document
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background. Nowadays, information retrieval system become more and more popular, it helps people retrieve information more efficiently and accelerates daily task. Within this context, Image processing technology play an important role that help transcribing content in printed or handwritten documents into digital data in information retrieval system. This transcribing procedure is called document digitization. In this transcribing procedure, image processing technique such as layout analysis and word recognition are employed to segment the document content and transcribe the image content into words. At this point, a Swedish company (ArkivDigital® AB) has a demand to transcribe their document data into digital data.
Objectives. In this study, the aim is to find out effective solution to extract document layout regard to the Swedish handwritten historical documents, which are featured by their tabular forms containing the handwritten content. In this case, outcome of application of OCRopus, OCRfeeder, traditional image processing techniques, machine learning techniques on Swedish historical hand-written document is compared and studied.
Methods. Implementation and experiment are used to develop three comparative solutions in this study. One is Hessian filtering with mask operation; another one is Gabor filtering with morphological open operation; the last one is Gabor filtering with machine learning classification. In the last solution, different alternatives were explored to build up document layout extraction pipeline. Hessian filter and Gabor filter are evaluated; Secondly, filter images with the better filter evaluated at previous stage, then refine the filtered image with Hough line transform method. Third, extract transfer learning feature and custom feature. Fourth, feed classifier with previous extracted features and analyze the result. After implementing all the solutions, sample set of the Swedish historical handwritten document is applied with these solutions and compare their performance with survey.
Results. Both open source OCR system OCRopus and OCRfeeder fail to deliver the outcome due to these systems are designed to handle general document layout instead of table layout. Traditional image processing solutions work in more than a half of the cases, but it does not work well. Combining traditional image process technique and machine leaning technique give the best result, but with great time cost.
Conclusions. Results shows that existing OCR system cannot carry layout analysis task in our Swedish historical handwritten document. Traditional image processing techniques are capable to extract the general table layout in these documents. By introducing machine learning technique, better and more accurate table layout can be extracted, but comes with a bigger time cost.
Place, publisher, year, edition, pages
2019. , p. 64
Keywords [en]
layout analysis, pattern recognition, document digitalization, table layout extraction, transfer learning, machine learning, image processing, Hessian filter, Gabor filter
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:bth-17643OAI: oai:DiVA.org:bth-17643DiVA, id: diva2:1292198
External cooperation
ArkivDigital
Subject / course
DV2572 Master´s Thesis in Computer Science
Educational program
DVACS Master of Science Programme in Computer Science
Supervisors
Examiners
Projects
Scalable resource-efficient systems for big data analytics2019-02-282019-02-272019-02-28Bibliographically approved