The participants from various stakeholders:
- Universities: ITC, RUPP, NIPTICT
- NGO: OI
- Private Sectors/Individual: Myself
Now it's time to be more open, I'll post more in detail but at first, book your available date for those who are interesting on presenting the Khmer OCR research or product; or even would like to join in the presentation of various researchers; The date on Tuesday 16th of September 2014, place will be inform soon.
There will be some invitations from the team to some individuals, group that the team aware about their work on OCR. There will be an announcement about this officially later soon.
The team is called: Khmer Natural Language Processing Consortium (Khmer NLP Consortium).
It's all about opensource, open data, open idea and methods to make things different.
Are you interesting to join the conference? If you have any thing to make some different for our community, please come to join us.
Updated 08/09
- The conference changed to 1st of October, see this post
I believe the best approach would be to use Tesseract instead of building something completely new. The software is open sourced and was built originally by a strong team. From what I see, just need to add a dictionary and some unicodes for the alphabet, which looks like are available and then just a matter of "training it", so to speak.
ReplyDeleteA better way to utilize the team would be to create a khmer GUI for the engine or look into ways to improve its ability to read text lines from an image that may be a few degrees tilted, or the background's possible discoloration.
I think the end result should be an online tool accepting multi-file png/jpg/tiff submissions while returning a text file, at least for starters to get it out there and then move forward after seeing how its being used etc... just my penny thrown in, but I am sure whoever is running this already considered this approach.
You are right. This team is focusing on Tesseract. And we will do to improve more before and after the tesseract result. We will findout what to improve in tesseract in order to have a good OCR for Khmer.
ReplyDeleteIf you are interesting to share your idea, please come to the conference but now the date is changed to 1st of October.
Hi,
ReplyDeleteI been living in PP for a few years, though I don't speak Khmer. I develop in C#, javascript or php, while Tesseract is in C++. I will think about it, if I have free time, though as I said I don't speak Khmer, which will likely be the language spoken. However, I found a solution to a problem one of the previous teams in one of the earlier posts on this website was unable to solve - the inability of Tesseract to process skewed lines of text in the images.
This library http://felix.abecassis.me/2011/09/opencv-detect-skew-angle/ , will compute the number of degrees the text is tilted on each image , and then its just the matter of rotating the image that number of degrees, which is the easy part.
Thanks.
Delete