This year is the year of Tesseract OCR engine since every where, every researchers are focusing on it in Cambodia. The TesseractOCR is an opensource OCR engine maintained by Google. In few years ago, there are some people had tried to train Khmer characters with the engine since 2009 but the result was not good enough to go.
Today, Danh Hong, a team leader of his OCR project and well known as the Khmer OS fonts designer with Thim Rithy, moonOS (Unix Kernel OS) founder has announced his result with an online tool: khmerocr.org that allows people to try the scanned document to convert into Khmer Unicode text.
Web Base Interface: KhmerOCR.org |
Currently he has asked for people to test and report error to improve the system.
I have taken some tests with my tested file that I have used in my previous research, the result needs a lot of time/tasks to improve.
All my below testing cases are using font: Khmer OS Content
Case 1: Real Scanned Document (no much noise), font size: 32pt
The document is written in Khmer OS Content with font size 32pt, scanned on HP Scanjet G3110 with high resolution which is clear enough. The result is not yet good enough.
Case 1 : Scanned Document |
Case 2: Printed Text Using MS Paint (no noise), font size: 11pt
I tried this serious document since the size is around the use case of people using.
Even it has no noise but the result is not well enough yet.
Case 2: Printed Text with Small font size |
Case 3: Printed Text Using MS Paint (no noise), font size: 48pt
With this font size, in my method with printed text could product around 98% of accuracy. Here is also producing good result
These tests are only just one part of the font face and it is also a result for developers to improve.
I believe TesseractOCR would do more better when more training data are made but it would not produce 100% accuracy as people expected. We need more involvement to make this at least 95% of accuracy together for people to use. That's why OCR conference is formed.
Thank to Danh Hong and his team for this public initiation.
We are waiting other people's result as well.
The result after train data: http://4.bp.blogspot.com/-BzBkDPDjFmE/VD0-RKDXwqI/AAAAAAAAFlI/GOaCxaVE8xk/s1600/Sokpongsametrey.png
ReplyDeleteWow, that's a good result. What need to be trained? The font face or the error words?
DeleteWould you mind to send the documents in case1. Is it scanned in 300dpi ?
ReplyDeleteAre you currently moving from one web site to another trying to look for effective tips and techniques in writing job cover letter? Many job seekers struggle with their application letters. They're desperate to find a proven letter format to help them craft their letters because they don't want to miss out any opportunity to land job interviews. See more cover letter for accounting job
ReplyDelete