Chhun, SophiaDepartment of Information and Communication Engineering, Institute of Technlogy of Cambodia, Cambodia
Author
Abstract
Analysis and recognition of historical documents faces many challenges, one of which is the scarcity of the ground truth data needed for most machine learning techniques, deep learning in particular. In this paper, we present a novel approach which significantly augments the word image samples generated from an existing dataset of Khmer ancient palm leaf manuscripts. Instead of segmenting real Khmer words, we combine the annotated glyphs into groups called sub-syllabes. A new text recognition method is also proposed to take into account the spatially complex structure of Khmer writing. The proposed method is compoused of two main modules: a feature generator and a decoder. The generator utilizes convolutional blocks, inception blocks, and also a bidirectional LSTM to encode information extracted from the input image so that it can be decoded by the attention-based decoder to predict the final text transcription. The experiments are conducted on a new dataset of sub-syllabes constructed from annotated glyphs of the SleukRith Set.
Valy, D., Verleysen, M., & Chhun, S. (2020). Data Augmentation and Text Recognition on Khmer Historical Manuscripts. 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund (Germany). https://hdl.handle.net/2078.5/254147