Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Quantifying the noise tolerance of the OCR engine Tesseract using a simulated environment
Blekinge Institute of Technology, Faculty of Computing, Department of Creative Technologies.
2014 (English)Student thesis
Abstract [en]

->Context. Optical Character Recognition (OCR), having a computer recognize text from an image, is not as intuitive as human recognition. Even small (to human eyes) degradations can thwart the OCR result. The problem is that random unknown degradations are unavoidable in a real-world setting. ->Objectives. The noise tolerance of Tesseract, a state-of-the-art OCR engine, is evaluated in relation to how well it handles salt and pepper noise, a type of image degradation. Noise tolerance is measured as the percentage of aberrant pixels when comparing two images (one with noise and the other without noise). ->Methods. A novel systematic approach for finding the noise tolerance of an OCR engine is presented. A simulated environment is developed, where the test parameters, called test cases (font, font size, text string), can be modified. The simulation program creates a text string image (white background, black text), degrades it iteratively using salt and pepper noise, and lets Tesseract perform OCR on it, in each iteration. The iteration process is stopped when the comparison between the image text string and the OCR result of Tesseract mismatches. ->Results. Simulation results are given as changed pixels percentage (noise tolerance) between the clean text string image and the text string image the degradation iteration before Tesseract OCR failed to recognize all characters in the text string image. The results include 14400 test cases: 4 fonts (Arial, Calibri, Courier and Georgia), 100 font sizes (1-100) and 36 different strings (4*100*36=14400), resulting in about 1.8 million OCR attempts performed by Tesseract. ->Conclusions. The noise tolerance depended on the test parameters. Font sizes smaller than 7 were not recognized at all, even without noise applied. The font size interval 13-22 was the peak performance interval, i.e. the font size interval that had the highest noise tolerance, except for the only monospaced font tested, Courier, which had lower noise tolerance in the peak performance interval. The noise tolerance trend for the font size interval 22-100 was that the noise tolerance decreased for larger font sizes. The noise tolerance of Tesseract as a whole, given the experiment results, was circa 6.21 %, i.e. if 6.21 % of the pixel in the image has changed Tesseract can still recognize all text in the image.

Place, publisher, year, edition, pages
2014. , p. 35
Keywords [en]
Optical Character Recognition, salt and pepper noise, Tesseract
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-4028Local ID: oai:bth.se:arkivexD948B0401BE4E11BC1257D0600683D71OAI: oai:DiVA.org:bth-4028DiVA, id: diva2:831347
Educational program
PAACI Master of Science in Game and Software Engineering
Uppsok
Technology
Supervisors
Note

42

Available from: 2015-04-22 Created: 2014-06-29 Last updated: 2018-01-11Bibliographically approved

Open Access in DiVA

fulltext(612 kB)2631 downloads
File information
File name FULLTEXT01.pdfFile size 612 kBChecksum SHA-512
9db9975b2c64c00378dac6c2c6a692ba96a0a7b4fc0488deb97adcbffee72818f97731ee8a5d0814cf21e8d0bf4f57732338a63006914cd4559e4f316420854f
Type fulltextMimetype application/pdf

By organisation
Department of Creative Technologies
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 2634 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1041 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf