KhmerOCR: KhmerOCR

Showing posts with label KhmerOCR. Show all posts

Wednesday, April 19, 2023

Master Thesis of OCR using deep learning - Pavel Andrlik, 2022

Thanks for another citation by Pavel Andrlik, at the Master Thesis on OCR using deep learning.

University of West Bohemia
Faculty of Applied Sciences
Department of Cybernetics
(Czech Republic)

The author has been highlight my paper over the classification method choice on using Support Vector machine...[]

Abstract of the Thesis

This diploma thesis deals with the problem of optical character recognition (OCR) using neural networks. I am focusing on improving text detection and OCR by fine-tuning an E2E-MLT scene text detector by training it on synthetic data which emulates real data. The model was fine-tuned on several datasets with synthetically generated data and real data, then the models were tested on one synthetic and two real datasets, one with the majority of the wild text, the second with the majority of TV news imprinted text. On the dataset with majority of TV news imprinted texts the fine-tuned models achieved improvement by decreasing character error rate from 52% to 31.6% word error rate and from 56.5% to 22%. It was also experimentally discovered that training models on synthetic data simulating real TV news images deteriorate detection and reading model capability on wild text data.

----------

What I am interesting is at the motivation side!

Full Thesis Can find at: https://otik.uk.zcu.cz/bitstream/11025/48953/1/Thesis___Pavel_Andrlik.pdf

My quick reflection on the motivation side!

The use case could also apply on some written paper for data collection such as on artist idea, random articles etc. we have a lot of handwriting or piece of writing printed that should also consider as collection on our language.

Tuesday, August 23, 2016

Paper: Experimental Comparison of the Performance of SVMs

The research paper on:

Experimental Comparison of the Performance of SVMs with Different Kernel Functions for Recognizing Arabic Characters

said Ghoniemy, Sayed Fadel, M. Asif

Abstract

A considerable progress in the recognition of Latin and Chinese characters has been achieved. By contrast, Arabic Optical character Recognition is still lagging. This is because Arabic language is a cursive language, written from right to left, and each character has different forms according to its position in the word. Support vector machines using kernel classifiers represent a typical approach for character recognition. Choosing the most appropriate kernel highly depends on the problem at hand – and fine tuning its parameters can easily become a tedious and cumbersome task. The present study is devoted to an experimental comparison of the performance of SVM machines with different kernel functions for recognizing Arabic Characters. Two groups of kernel functions were used throughout the study, each group contains 7 kernel functions. The obtained results show that, in the radial basis group, Laplacian kernel gives the best results. In the special functions group, the T-Student approach gives the best results. However, combing both kernels did not yield better performance.

Read detail

[..]

Sok, P. and Taing, N., "Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set Recognition", Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA) (pp. 1-9). IEEE. December 2014.

[..]

Thanks for cited my Research on SVM related method

Wednesday, December 30, 2015

Try on between NIPTICT OCR and KhmerOCR.org

I tested the tool, I make with iText for Khmer render on PDF in my post of http://ask.osify.com/qa/613 which is using Khmer OS Battambang to generate the PDF.

I tested on two URLs:

1. http://KhmerOCR.org by providing the image file
2. http://rnd.niptict.edu.kh/ocr/ by uploading the PDF file

Here is the result so far

Tuesday, May 26, 2015

KhmerOCR Demo App Released on GitHub

First of all, as I have already stated in my GitHub, do not expect this release app, the full OCR system but it's only my demo at the first sight to answer to my research using Support Vector Machine in 2013 and slightly updated on 2014. Thanks for understanding.

Since I do not commit my time to continue on this topic, I would prefer to publish the demo and soon will make up the source code to public as well.

Currently people are working on TesseractOCR and we are waiting for result, of course some result can be found with the OCR Team at khmerocr.org, please try out and support this team if any.

Here if you're still interesting to see, mine, please download from GitHub: KhmerOCR.NET-App

Tuesday, October 14, 2014

First online TesseractOCR Engine Based for KhmerOCR

KhmerType just announced his online KhmerOCR implemented with TesseractOCR engine.

This year is the year of Tesseract OCR engine since every where, every researchers are focusing on it in Cambodia. The TesseractOCR is an opensource OCR engine maintained by Google. In few years ago, there are some people had tried to train Khmer characters with the engine since 2009 but the result was not good enough to go.

Today, Danh Hong, a team leader of his OCR project and well known as the Khmer OS fonts designer with Thim Rithy, moonOS (Unix Kernel OS) founder has announced his result with an online tool: khmerocr.org that allows people to try the scanned document to convert into Khmer Unicode text.

Web Base Interface: KhmerOCR.org

Currently he has asked for people to test and report error to improve the system.

I have taken some tests with my tested file that I have used in my previous research, the result needs a lot of time/tasks to improve.

All my below testing cases are using font: Khmer OS Content

Case 1: Real Scanned Document (no much noise), font size: 32pt

The document is written in Khmer OS Content with font size 32pt, scanned on HP Scanjet G3110 with high resolution which is clear enough. The result is not yet good enough.

Case 1 : Scanned Document

Updated 15/10: Danh Hong helps to train the expected document and the result is good. (See in comment)

Case 2: Printed Text Using MS Paint (no noise), font size: 11pt

I tried this serious document since the size is around the use case of people using.

Even it has no noise but the result is not well enough yet.

Case 2: Printed Text with Small font size

Case 3: Printed Text Using MS Paint (no noise), font size: 48pt

With this font size, in my method with printed text could product around 98% of accuracy. Here is also producing good result

These tests are only just one part of the font face and it is also a result for developers to improve.

I believe TesseractOCR would do more better when more training data are made but it would not produce 100% accuracy as people expected. We need more involvement to make this at least 95% of accuracy together for people to use. That's why OCR conference is formed.

Thank to Danh Hong and his team for this public initiation.

We are waiting other people's result as well.

Monday, October 13, 2014

Welcome to the Khmer OCR Conference, 28th October

The event is now announced to public, the Khmer OCR conference which is hosting at conference hall of Ministry of Posts and Telecommunication.

The OCR (Optical Character Recognition) software has been around in the world to convert the printed text on the image, pdf or scanned paper into the computed text or characters.

Khmer OCR topic has been in the researching phrase long time ago with some researchers already but most of the case, each researcher is trying to solve different issue in the OCR technologies such in as in segmentation (line or character separation), recognition or classification etc.

Now the conference is about to focus on producing the software that work for public uses, the invite all related researchers to discuss about different solution and TODO list for the next steps.

The event is for invited person only, please contact the host: research@niptict.edu.kh if you would like to participate.

The event only happens after some discussion and meeting with the team so far.

Sunday, September 28, 2014

Stay Tune, The OCR Conference is Delay Again

As receiving the update today, the venue informed us that the conference which was first delayed to 1st of October, it has been postponed again to end of October due to the invitation and want to have all potential researchers on board.

The invitation is promised to come by next week, wait and see.

Saturday, August 2, 2014

We are in need of the product, OCR Project for Khmer Language

As I have been stated in previous article about "The State of The Art", KhmerOCR is always in researching state and no yet the ready product.

More recently, some groups are challenging this and welling to introduce the product by forming a concreted team for that. Together there are some individual team also are doing the same thing here.

The joint team by some universities and individual researchers was formed a meeting recently on 31st July.
Now it's not yet to detail how will be but it's great to see more people were happy and willing to contribute into the project for our Cambodia.

And yet a surprise, I just saw another project is presenting and on asking for funding: OCR Khmer, it seems to be an online tool, let's watch their promotional video:

According to the video, the online OCR project is likely to be running on printed image of the font size of 36pt.

The project is asking for funding of $4,000 at the website of gofundme.com.
It's great to see the product some where around, let's help him, you can click here for more info.

Tuesday, July 22, 2014

Excited to See Thing's On the Way [Blog]

I'm glad to see thing is on its way, glad to meet more people involve in the current issue.
I do hope the solidarity will bring thing out in a better result.

Let's dream! Let's plan!

Wednesday, April 23, 2014

State of The Art Of KhmerOCR Implementation

There aren't many articles when we search on the Internet about KhmerOCR topic, I, myself don't find a lot as well.

Of course, I believe that there are some people or companies might quietly in implementing the solution for that but without any announce I believe my presume below are relevant enough for people to understand about current situation of Khmer OCR.

Let's share around "State of The Art Of KhmerOCR" today ;)

I could find that, there are about several Khmer OCR researches being published through some organization, website and universities.

Methodologies

When we talk about Khmer OCR, we suppose around the solutions to make any characters from scanned images of handwritten, typewritten or printed text converts into machine-encoded text.

Solutions on OCR system, mostly focus on:

Pre-processing (usually is noise removal)
Segmentation

Line segmentation
Character segmentation

Recognition
Mapping (Character Assembling)

And there are some methods already used for Khmer OCR in segmentation or recognition part such as

Lagendre Moment Descriptor,
Wavelet Descriptor,
Hidden Markov model (HMM),
Back propagation,
Scale Invariant Fourier Transform (SIFT),
Fourier Descriptor, Hole detection
Template Matching
etc.
And (it seems) the last one is: Support Vector Machine (SVM)

Literature Review/History

I might miss some others but here are what I could find about what have done so far with this topic.

If you, guys, have know some more, please share to people through comment form. I will check and update.

The Khmer Printed Characters Recognition using Lagendre Moment Descriptor by Chey Chanoeurn et al got 92% of accuracy on 10 Khmer consonants including ប ព ជ ក ភ ណ ឃ ស វ and ឆ
2005, The Khmer Printed Character Recognition Using Wavelet Descriptors by Chey Chanoeurn et al got the accuracy of 92.85%, 91.66% and 89.27% on 10 types of Khmer fonts in 3 different sizes.
2008, The Khmer Segmentation for font Limon S1, size 22 by Ing Leng Ieng, PAN Localization Project got the accuracy of 99.11%.
2009, The Khmer OCR for Limon R1 Size 22 by Ing Leng Ieng from PAN Localization Project using framing and Discrete Cosine Transform calculation for recognition based on Hidden Markov Model and got the accuracy of 98.88%.
2011, The Khmer Optical Character Recognition (OCR) by Mr. Kruy Vanna using Fourier Descriptors, Component’s Holes, and Component’s Location got accuracy of 97.9% on 19 types of Khmer font.
2012, The Khmer Printed Character Recognition uses combining of Edge Detection and Template Matching by Iech Setha et al for one font “Khmer OS Content” with font size of 36pt got accuracy of 99%
2013, The Khmer Printed Character Recognition using Support Vector Machine (SVM) based, by Pongsametrey SOK for one font “Khmer OS Content” with font size of 36pt got accuracy of 98.54% (32pt = 98.62%, 28pt = 98.18%) with training set of font size: 32pt

The research No. 7, I did it and it's submitted at Royal University of Phnom Penh (RUPP). So it's not publicly publish any where yet.

Who are doing it nowadays

That's who I have known around in Cambodia only, it might be people who does some study abroad is also doing it. Anyway here what I have known:

Institute of Technology of Cambodia (ITC), It seems, there're some continuing implementation of KhmerOCR there
Royal University of Phnom Penh (RUPP) also doing some more researches on this matter through students' researches, thesis and with their lecturers.
Open Institute (Open Forum, KhmerOS.info), I believe that this topic is still interesting by this NGO
And there are some other individuals as well as I heard (?)

What's Interesting

One opensource OCR engine, Tesseract OCR, it's a completed engine from the image processing to recognition and its output.

What we need for our Khmer language works for it, we need to analyze "how to train" our dataset.

I also did some training for Khmer as well for Tesseract for some letters, it seems that the system is good to go but there are some thing we need to aware before as I posted a question here.
I will try to write a post on how to train some characters that I did before.

Few Training Char, All Are Error

Why Tesseract at this time?
Previous researches are mostly using their own combination of methods to solve various issue for Khmer language such as in segmentation or recognition but the pre-processing process (image processing) is also important for a real OCR system and its accuracy.
And I could see that Tesseract OCR is ready for all of that.

Is There Anyone Already Try for Tesseract?
Yes, you can search on Google, it has already been trying since 2009 per my search on Google and around.
And it might be already made by some universities or lecturers but remaining unclear for me.

So, Is There Any Ready Tesseract OCR for Khmer?
My presumed answer: No, I've never heard that there's a ready training set for Khmer yet to use in Tesseract OCR Engine.

But, just today, I checked again at the repo of Tesseract (14 January 2014), I saw some Khmer config is added (Files: Khmer.unicharset, Khmer.xheights), we need to test if they are working.

Therefore, Students, Lecturers, some NGO or community should take part to help this.

Conclusions

The OCR system is very interesting for people nowadays.

We are using Khmer Unicode since it established in 2003 in the Kingdom and with Unicode, we have Google translate recently. Then, Khmer OCR should be also solved somehow as well.

Pages