Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
ST-KeyS: Self-supervised Transformer for Keyword Spotting in historical handwritten documents
Digital Research Center of Sfax,Tunisia.
Digital Research Center of Sfax,Tunisia.
Universitat Autònoma de Barcelona, Spain.
Digital Research Center of Sfax,Tunisia.
Show others and affiliations
2026 (English)In: Pattern Recognition, ISSN 0031-3203, E-ISSN 1873-5142, Vol. 170, article id 112036Article in journal (Refereed) Published
Abstract [en]

Keyword spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections. Nowadays, the most efficient KWS methods rely on machine learning techniques, which typically require a large amount of annotated training data. However, in the case of historical manuscripts, there is a lack of annotated corpora for training. To handle the data scarcity issue, we investigate the merits of self-supervised learning to extract useful representations of the input data without relying on human annotations and then use these representations in the downstream task. We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm without the need for labeled data. In the fine-tuning stage, the pre-trained encoder is integrated into a fine-tuned Siamese neural network model to improve feature embedding from the input images. We further improve the image representation using pyramidal histogram of characters (PHOC) embedding to create and exploit an intermediate representation of images based on text attributes. The proposed approach outperforms state-of-the-art methods trained on the same datasets in an exhaustive experimental evaluation of five widely used benchmark datasets (Botany, Alvermann Konzilsprotokolle, George Washington, Esposalles, and RIMES). 

Place, publisher, year, edition, pages
Elsevier, 2026. Vol. 170, article id 112036
Keywords [en]
Keyword spotting, Masked autoencoders, PHOC embedding, Self-supervised learning, Siamese neural networks, Visual transformers, Character recognition, History, Image representation, Labeled data, Learning algorithms, Learning systems, Neural networks, Supervised learning, Auto encoders, Embeddings, Handwritten document, Historical documents, Masked autoencoder, Neural-networks, Pyramidal histogram of character embedding, Siamese neural network, Visual transformer, Signal encoding
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:bth-28471DOI: 10.1016/j.patcog.2025.112036ISI: 001528801600002Scopus ID: 2-s2.0-105009722690OAI: oai:DiVA.org:bth-28471DiVA, id: diva2:1988257
Part of project
DocPRESERV – Preserving & Processing Historical Document Images with Artificial Intelligence, The Swedish Foundation for International Cooperation in Research and Higher Education (STINT)
Funder
The Swedish Foundation for International Cooperation in Research and Higher Education (STINT), AF2020-8892Available from: 2025-08-11 Created: 2025-08-11 Last updated: 2025-09-30Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Cheddad, Abbas

Search in DiVA

By author/editor
Cheddad, Abbas
By organisation
Department of Computer Science
In the same journal
Pattern Recognition
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 54 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf