A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set

Valy, Dona; Verleysen, Michel; Chhun, Sophea; Burie, Jean-Christophe

A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set

Valy, Dona

;

Verleysen, Michel

;

Chhun, Sophea

;

Burie, Jean-Christophe

(2017) 4th International Workshop on Historical Document Imaging and Processing (HIP) at ICDAR2017 — Location: Kyoto (Japan) (8.November.2017)

Files

ANewKhmerPalmLeafManuscriptDatasetforDocumentAnalysisandRecognition-SleukRithSet.pdf

Open Access
Adobe PDF
2.7 MB

Download

Details

Authors

Valy, DonaUCLouvain
Author
Verleysen, MichelUCLouvain
Author
Chhun, SopheaDepartment of Information and Communication Engineering, Institute of Technology of Cambodia, Cambodia
Author
Burie, Jean-ChristopheLaboratoire Informatique Image Interaction (L3i), University of La Rochelle, France
Author

Abstract

Analysis of ancient Khmer documents can be quite challenging due to the elaborated shape of Khmer handwritten characters combined with the complex structure of how words are formed from those characters. Palm leaf manuscripts, one of the most well-known old Khmer documents, have been being digitized and centralized; therefore, document analysis functions such as text search capabilities are necessary but still remain unavailable for this type of documents. In order to contribute to the progress of relevant researches, we introduce in this paper a new dataset called SleukRith set comprising of 657 pages of Khmer palm leaf manuscripts randomly selected from various collections whose quality and digitization method are variable. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting. In order to serve as a base line, the result of an evaluation study of Khmer isolated character recognition that we have conducted on SleukRith Set using Convolutional Neural Network is also presented.

Affiliations

UCLouvainSST/ICTM/ELEN - Pôle en ingénierie électrique

Citations

APA
Chicago
FWB

Valy, D., Verleysen, M., Chhun, S., & Burie, J.-C. (2017). A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set. 4th International Workshop on Historical Document Imaging and Processing (HIP) at ICDAR2017, Kyoto (Japan). https://hdl.handle.net/2078.5/254125