Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use your serial project to realize a ocr for chinese #14

Open
wanghaisheng opened this issue Apr 6, 2016 · 5 comments
Open

how to use your serial project to realize a ocr for chinese #14

wanghaisheng opened this issue Apr 6, 2016 · 5 comments

Comments

@wanghaisheng
Copy link

i have a lot of xps/pdf file which can transform to jpeg files,
1.do i need to generate millions of chinese characters like your " datagen_initio "
2.what about font and encoding for chinese Character "Mallicodes"
3.do i need to prepare box files generated by antanci_segmenter /OCR Segmenter

@ChillarAnand
Copy link
Contributor

@rakeshvar It would be great if you can list the steps needs to followed to extended banti to other languages.

@rakeshvar
Copy link
Collaborator

@wanghaisheng
You might have a lot of implementations of Chinese OCR elsewhere on the web. It is a problem that has received much more attention than the Indian language OCRs. But if you want to follow along the same lines. Here is a brief outline.

  1. Generate a lot of images to train a CNN with and then train the CNN.
  2. Redesign the segmentation part (page.py) to better suit Chinese (you should be able to find chinese text segmenters online too.)
  3. You need to specify an ngram dictionary of counts (build123grams.py)

@rakeshvar
Copy link
Collaborator

@ChillarAnand
I am not sure how good the banti framework is for extension. It can be, there is no doubt. I am thinking of the chamanti framework which is much more easy to extend. You might be interested in working on that. I can post guidelines for that.

What do you think is the best way to make this collaborative with minimal amount of work from my side (I really can not spend much time on these things). A github.io page ? A google group? Ideally there will be a post, and a scope for discussions and questions. Please do suggest. Thanks.

@ChillarAnand
Copy link
Contributor

@rakeshvar Should we use Github issue tracker itself for discussion?

@wanghaisheng
Copy link
Author

a blog post would be best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants