Chhun, SopheaDepartment of Information and Communication Engineering, Institute of Technology of Cambodia, Cambodia
Author
Burie, Jean-ChristopheLaboratoire Informatique Image Interaction (L3i), University of La Rochelle, France
Author
Abstract
Analysis of ancient Khmer documents can be quite challenging due to the elaborated shape of Khmer handwritten characters combined with the complex structure of how words are formed from those characters. Palm leaf manuscripts, one of the most well-known old Khmer documents, have been being digitized and centralized; therefore, document analysis functions such as text search capabilities are necessary but still remain unavailable for this type of documents. In order to contribute to the progress of relevant researches, we introduce in this paper a new dataset called SleukRith set comprising of 657 pages of Khmer palm leaf manuscripts randomly selected from various collections whose quality and digitization method are variable. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting. In order to serve as a base line, the result of an evaluation study of Khmer isolated character recognition that we have conducted on SleukRith Set using Convolutional Neural Network is also presented.
Valy, D., Verleysen, M., Chhun, S., & Burie, J.-C. (2017). A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set. 4th International Workshop on Historical Document Imaging and Processing (HIP) at ICDAR2017, Kyoto (Japan). https://hdl.handle.net/2078.5/254125