Tuesday, October 14, 2014

First online TesseractOCR Engine Based for KhmerOCR

KhmerType just announced his online KhmerOCR implemented with TesseractOCR engine.

This year is the year of Tesseract OCR engine since every where, every researchers are focusing on it in Cambodia. The TesseractOCR is an opensource OCR engine maintained by Google. In few years ago, there are some people had tried to train Khmer characters with the engine since 2009 but the result was not good enough to go.

Today, Danh Hong, a team leader of his OCR project and well known as the Khmer OS fonts designer with Thim Rithy, moonOS (Unix Kernel OS) founder has announced his result with an online tool: khmerocr.org that allows people to try the scanned document to convert into Khmer Unicode text.

Web Base Interface: KhmerOCR.org

Currently he has asked for people to test and report error to improve the system.

I have taken some tests with my tested file that I have used in my previous research, the result needs a lot of time/tasks to improve.

All my below testing cases are using font: Khmer OS Content

Case 1: Real Scanned Document (no much noise), font size: 32pt

The document is written in Khmer OS Content with font size 32pt, scanned on HP Scanjet G3110 with high resolution which is clear enough. The result is not yet good enough.

Case 1 : Scanned Document
Updated 15/10: Danh Hong helps to train the expected document and the result is good. (See in comment)

Case 2: Printed Text Using MS Paint (no noise), font size: 11pt

I tried this serious document since the size is around the use case of people using.
Even it has no noise but the result is not well enough yet.

Case 2: Printed Text with Small font size

Case 3: Printed Text Using MS Paint (no noise), font size: 48pt

With this font size, in my method with printed text could product around 98% of accuracy. Here is also producing good result



These tests are only just one part of the font face and it is also a result for developers to improve.
I believe TesseractOCR would do more better when more training data are made but it would not produce 100% accuracy as people expected. We need more involvement to make this at least 95% of accuracy together for people to use. That's why OCR conference is formed.

Thank to Danh Hong and his team for this public initiation.
We are waiting other people's result as well.

Monday, October 13, 2014

Welcome to the Khmer OCR Conference, 28th October

The event is now announced to public, the Khmer OCR conference which is hosting at conference hall of Ministry of Posts and Telecommunication.

The OCR (Optical Character Recognition) software has been around in the world to convert the printed text on the image, pdf or scanned paper into the computed text or characters.

Khmer OCR topic has been in the researching phrase long time ago with some researchers already but most of the case, each researcher is trying to solve different issue in the OCR technologies such in as in segmentation (line or character separation), recognition or classification etc.

Now the conference is about to focus on producing the software that work for public uses, the invite all related researchers to discuss about different solution and TODO list for the next steps.

The event is for invited person only, please contact the host: research@niptict.edu.kh if you would like to participate.

The event only happens after some discussion and meeting with the team so far.