Tuesday, October 14, 2014

First online TesseractOCR Engine Based for KhmerOCR

KhmerType just announced his online KhmerOCR implemented with TesseractOCR engine.

This year is the year of Tesseract OCR engine since every where, every researchers are focusing on it in Cambodia. The TesseractOCR is an opensource OCR engine maintained by Google. In few years ago, there are some people had tried to train Khmer characters with the engine since 2009 but the result was not good enough to go.

Today, Danh Hong, a team leader of his OCR project and well known as the Khmer OS fonts designer with Thim Rithy, moonOS (Unix Kernel OS) founder has announced his result with an online tool: khmerocr.org that allows people to try the scanned document to convert into Khmer Unicode text.

Web Base Interface: KhmerOCR.org

Currently he has asked for people to test and report error to improve the system.

I have taken some tests with my tested file that I have used in my previous research, the result needs a lot of time/tasks to improve.

All my below testing cases are using font: Khmer OS Content

Case 1: Real Scanned Document (no much noise), font size: 32pt

The document is written in Khmer OS Content with font size 32pt, scanned on HP Scanjet G3110 with high resolution which is clear enough. The result is not yet good enough.

Case 1 : Scanned Document
Updated 15/10: Danh Hong helps to train the expected document and the result is good. (See in comment)

Case 2: Printed Text Using MS Paint (no noise), font size: 11pt

I tried this serious document since the size is around the use case of people using.
Even it has no noise but the result is not well enough yet.

Case 2: Printed Text with Small font size

Case 3: Printed Text Using MS Paint (no noise), font size: 48pt

With this font size, in my method with printed text could product around 98% of accuracy. Here is also producing good result

These tests are only just one part of the font face and it is also a result for developers to improve.
I believe TesseractOCR would do more better when more training data are made but it would not produce 100% accuracy as people expected. We need more involvement to make this at least 95% of accuracy together for people to use. That's why OCR conference is formed.

Thank to Danh Hong and his team for this public initiation.
We are waiting other people's result as well.


  1. The result after train data: http://4.bp.blogspot.com/-BzBkDPDjFmE/VD0-RKDXwqI/AAAAAAAAFlI/GOaCxaVE8xk/s1600/Sokpongsametrey.png

    1. Wow, that's a good result. What need to be trained? The font face or the error words?

  2. Would you mind to send the documents in case1. Is it scanned in 300dpi ?

  3. Are you currently moving from one web site to another trying to look for effective tips and techniques in writing job cover letter? Many job seekers struggle with their application letters. They're desperate to find a proven letter format to help them craft their letters because they don't want to miss out any opportunity to land job interviews. See more cover letter for accounting job