jeudi 9 avril 2015

New Georgian (kartuli ena) traineddata for Tesseract


I've recently finished training tesseract 3.03-rc1 on the Georgian language, using tesstrain.sh and based off the files in the langdata repository. I created my own word list and bigrams list using Wikipedia.

Performance is very good on high-quality scans with modern fonts, but it doesn't do very well on older documents; I'm not sure whether this is because of differences in the font, or because the synthetic images generated by the tesstrain.sh script don't give tesseract enough training in handling degraded images.

I've uploaded the traineddata file and all training files here: https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip

I'm attaching a test image (a randomly-selected scan from Georgia's registry of corporations) and the output of running tesseract recognition on the test image. No pre-processing was done on the test image except to upsample it to 300dpi. The test image contains some Latin characters so I ran tesseract with the language selector "kat+eng".

The licensing for any documents to which I hold the copyright is the same as the tesseract source, i.e. the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).
Pièces jointes (2)
NIKA_28.txt
4 Ko   Afficher   Télécharger
NIKA_28.png
696 Ko   Afficher   Télécharger
Cliquez ici pour répondre
sventech
2 avr.
Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)
--Sven

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”
shree
2 avr.
Please see 

It maybe possible to do additional training using degraded versions of 'synthetic' images which may improve recognition of older documents.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
Derek
3 avr.
ShreeDevi,

Thanks for this -- I tried re-training tesseract with a range of exposure values passed to text2image, but didn't see improved results.

However, I did notice in the process that the x-heights for the document I was attempting to recognize were near the lower limit of what Tesseract can handle (~10px), so I doubled the image size. This resulted in much improved recognition; there are still errors, but fewer of them and they "make sense" now. Tesseract isn't able to segment the 5-column page layout very well, but otherwise I'm pretty happy with the results.

Derek

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
zdenop
3 avr.
Can you create a repository for your training (in sourceforge or  github)?

Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).


Zdenko

Zdenko

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
Derek
4 avr.
Hi Zdenko,

Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/tesseract-georgian

Cheers,
Derek

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
zdenop
4 avr.
Thanks. I put link to AddOn wiki.

Zdenko

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
sibi kanagaraj
8 avr.
Hi Derek ,

Excellent Documentation .

A small correction in the documentation .

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>>

but the actual fie name  is wordcounts.py .

-Sibi

Aucun commentaire:

Enregistrer un commentaire