This year is the year of Tesseract OCR engine since every where, every researchers are focusing on it in Cambodia. The TesseractOCR is an opensource OCR engine maintained by Google. In few years ago, there are some people had tried to train Khmer characters with the engine since 2009 but the result was not good enough to go.
Today, Danh Hong, a team leader of his OCR project and well known as the Khmer OS fonts designer with Thim Rithy, moonOS (Unix Kernel OS) founder has announced his result with an online tool: khmerocr.org that allows people to try the scanned document to convert into Khmer Unicode text.
Web Base Interface: KhmerOCR.org |
Currently he has asked for people to test and report error to improve the system.
I have taken some tests with my tested file that I have used in my previous research, the result needs a lot of time/tasks to improve.
All my below testing cases are using font: Khmer OS Content
Case 1: Real Scanned Document (no much noise), font size: 32pt
The document is written in Khmer OS Content with font size 32pt, scanned on HP Scanjet G3110 with high resolution which is clear enough. The result is not yet good enough.
Case 1 : Scanned Document |
Case 2: Printed Text Using MS Paint (no noise), font size: 11pt
I tried this serious document since the size is around the use case of people using.
Even it has no noise but the result is not well enough yet.
Case 2: Printed Text with Small font size |
Case 3: Printed Text Using MS Paint (no noise), font size: 48pt
With this font size, in my method with printed text could product around 98% of accuracy. Here is also producing good result
These tests are only just one part of the font face and it is also a result for developers to improve.
I believe TesseractOCR would do more better when more training data are made but it would not produce 100% accuracy as people expected. We need more involvement to make this at least 95% of accuracy together for people to use. That's why OCR conference is formed.
Thank to Danh Hong and his team for this public initiation.
We are waiting other people's result as well.