Wednesday, November 29, 2023

Research Article: Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Research article on IEEE of the topic: "Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition"

by RINA BUOY 1 , (Graduate Student Member, IEEE), MASAKAZU IWAMURA 1 , (Member, IEEE), SOVILA SRUN2 , AND KOICHI KISE1

ABSTRACT 

Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long text lines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.




Wednesday, April 19, 2023

Master Thesis of OCR using deep learning - Pavel Andrlik, 2022

Thanks for another citation by Pavel Andrlik, at the Master Thesis on OCR using deep learning.

University of West Bohemia
Faculty of Applied Sciences
Department of Cybernetics
(Czech Republic)


The author has been highlight my paper over the classification method choice on using Support Vector machine...[]


 Abstract of the Thesis

This diploma thesis deals with the problem of optical character recognition (OCR) using neural networks. I am focusing on improving text detection and OCR by fine-tuning an E2E-MLT scene text detector by training it on synthetic data which emulates real data. The model was fine-tuned on several datasets with synthetically generated data and real data, then the models were tested on one synthetic and two real datasets, one with the majority of the wild text, the second with the majority of TV news imprinted text. On the dataset with majority of TV news imprinted texts the fine-tuned models achieved improvement by decreasing character error rate from 52% to 31.6% word error rate and from 56.5% to 22%. It was also experimentally discovered that training models on synthetic data simulating real TV news images deteriorate detection and reading model capability on wild text data.

----------

What I am interesting is at the motivation side!

My quick reflection on the motivation side!

The use case could also apply on some written paper for data collection such as on artist idea, random articles etc. we have a lot of handwriting or piece of writing printed that should also consider as collection on our language.


Sunday, October 17, 2021

A compact deep learning model for Khmer handwritten text recognition

Bayram Annanurov, Norliza Mohd Noor 

Department of Computer Science, Paragon International University, Cambodia 

Department of Engineering, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Malaysia

Abstract (of the Paper)

The motivation of this study is to develop a compact offline recognition model for Khmer handwritten text that would be successfully applied under limited access to high-performance computational hardware. Such a task aims to ease the ad-hoc digitization of vast handwritten archives in many spheres. Data collected for previous experiments were used in this work. The oneagainst-all classification was completed with state-of-the-art techniques. A compact deep learning model (2+1CNN), with two convolutional layers and one fully connected layer, was proposed. The recognition rate came out to be within 93-98%. The compact model is performed on par with the state-of-theart models. It was discovered that computational capacity requirements usually associated with deep learning can be alleviated, therefore allowing applications under limited computational power.

Link To the Page 






Friday, February 19, 2021

Optical character recognition system for Baybayin scripts using support vector machine

A new publishing related to SVM method on OCR case, "Optical character recognition system for Baybayin scripts using support vector machine" -  https://peerj.com/articles/cs-360/


Thanks for citation that to have more clearer that the method could work in some other cases.



This part is delight me and remind it back.




Abstract (of the paper)

 In 2018, the Philippine Congress signed House Bill 1022 declaring the Baybayin script as the Philippines’ national writing system. In this regard, it is highly probable that the Baybayin and Latin scripts would appear in a single document. In this work, we propose a system that discriminates the characters of both scripts. The proposed system considers the normalization of an individual character to identify if it belongs to Baybayin or Latin script and further classify them as to what unit they represent. This gives us four classification problems, namely: (1) Baybayin and Latin script recognition, (2) Baybayin character classification, (3) Latin character classification, and (4) Baybayin diacritical marks classification. To the best of our knowledge, this is the first study that makes use of Support Vector Machine (SVM) for Baybayin script recognition. This work also provides a new dataset for Baybayin, its diacritics, and Latin characters. Classification problems (1) and (4) use binary SVM while (2) and (3) apply the multiclass SVM classification. On average, our numerical experiments yield satisfactory results: (1) has 98.5% accuracy, 98.5% precision, 98.49% recall, and 98.5% F1 Score; (2) has 96.51% accuracy, 95.62% precision, 95.61% recall, and 95.62% F1 Score; (3) has 95.8% accuracy, 95.85% precision, 95.8% recall, and 95.83% F1 Score; and (4) has 100% accuracy, 100% precision, 100% recall, and 100% F1 Score.

Wednesday, July 22, 2020

i2ocr - Free Online Khmer OCR, It works!

i2OCR.com has provided Khmer OCR free for everyone to use, I have tested and got a good enough result, I can say around 95% is OK if it is the Khmer Unicode text.

I will try out sometimes on old Limon or ABC fonts, for handwriting text is not working.

So you may try when you need: http://www.i2ocr.com/free-online-khmer-ocr

The tool now is added into my list, Khmer Tools.



Sample testing


Sunday, November 24, 2019

Application of Support Vector Machine in Prediction Secondary Structure Protein

Application of Support Vector Machine in Prediction Secondary Structure Protein
NI Jabbar, RI Jabbar

This paper studying the predication of secondary structure protein from primary
structure protein using support vector machine (SVM). We classify 64 types of
proteins in three types: Helices (H), Strand (E) and Coil (C).


View Here.



Thanks for citation.

Monday, July 1, 2019

Wring Lunar Date in Khmer: Khmer Date Converter

Writing Lunar Date in Khmer is for all official documents using in Cambodia's government if you have noticed.

And of course, it is also a hardest one for other people who are not in government office as we are mostly not using Lunar calendar in private but when we need to use it, we always check Lunar calendar issued by corresponding organization.

Here is a help tool: https://tools.wikischool.asia/KhmerLunar

The tool helps for writing the correct date and in Khmer language.


If you find this useful and also want to continue engage with WikiSchool, reach them at facebook.

Updated (2021/02):
  • The tool unpublished for sometimes being, be back, will inform.