ST-KeyS: Self-supervised Transformer for Keyword Spotting in historical handwritten documentsShow others and affiliations
2026 (English)In: Pattern Recognition, ISSN 0031-3203, E-ISSN 1873-5142, Vol. 170, article id 112036Article in journal (Refereed) Published
Abstract [en]
Keyword spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections. Nowadays, the most efficient KWS methods rely on machine learning techniques, which typically require a large amount of annotated training data. However, in the case of historical manuscripts, there is a lack of annotated corpora for training. To handle the data scarcity issue, we investigate the merits of self-supervised learning to extract useful representations of the input data without relying on human annotations and then use these representations in the downstream task. We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm without the need for labeled data. In the fine-tuning stage, the pre-trained encoder is integrated into a fine-tuned Siamese neural network model to improve feature embedding from the input images. We further improve the image representation using pyramidal histogram of characters (PHOC) embedding to create and exploit an intermediate representation of images based on text attributes. The proposed approach outperforms state-of-the-art methods trained on the same datasets in an exhaustive experimental evaluation of five widely used benchmark datasets (Botany, Alvermann Konzilsprotokolle, George Washington, Esposalles, and RIMES).
Place, publisher, year, edition, pages
Elsevier, 2026. Vol. 170, article id 112036
Keywords [en]
Keyword spotting, Masked autoencoders, PHOC embedding, Self-supervised learning, Siamese neural networks, Visual transformers, Character recognition, History, Image representation, Labeled data, Learning algorithms, Learning systems, Neural networks, Supervised learning, Auto encoders, Embeddings, Handwritten document, Historical documents, Masked autoencoder, Neural-networks, Pyramidal histogram of character embedding, Siamese neural network, Visual transformer, Signal encoding
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:bth-28471DOI: 10.1016/j.patcog.2025.112036ISI: 001528801600002Scopus ID: 2-s2.0-105009722690OAI: oai:DiVA.org:bth-28471DiVA, id: diva2:1988257
Part of project
DocPRESERV – Preserving & Processing Historical Document Images with Artificial Intelligence, The Swedish Foundation for International Cooperation in Research and Higher Education (STINT)
Funder
The Swedish Foundation for International Cooperation in Research and Higher Education (STINT), AF2020-88922025-08-112025-08-112025-09-30Bibliographically approved