I've
recently finished training tesseract 3.03-rc1 on the Georgian language,
using tesstrain.sh and based off the files in the langdata repository. I
created my own word list and bigrams list using Wikipedia.
Performance
is very good on high-quality scans with modern fonts, but it doesn't do
very well on older documents; I'm not sure whether this is because of
differences in the font, or because the synthetic images generated by
the tesstrain.sh script don't give tesseract enough training in handling
degraded images.
I've uploaded the traineddata file and all training files here: https://dl. dropboxusercontent.com/u/ 11840441/kat_train20150401.zip
I'm
attaching a test image (a randomly-selected scan from Georgia's
registry of corporations) and the output of running tesseract
recognition on the test image. No pre-processing was done on the test
image except to upsample it to 300dpi. The test image contains some
Latin characters so I ran tesseract with the language selector
"kat+eng".
The licensing for any documents to
which I hold the copyright is the same as the tesseract source, i.e. the
Apache License, Version 2.0 (http://www.apache.org/ licenses/LICENSE-2.0).
Cliquez ici pour répondre
| sventech |
2 avr.
|
Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)
--Sven
- afficher le texte des messages précédents -
--- afficher le texte des messages précédents -
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6- e050-450b-8bcd-17dde924aafe% 40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”
| shree |
2 avr.
|
Please see
It
maybe possible to do additional training using degraded versions of
'synthetic' images which may improve recognition of older documents.
ShreeDevi
______________________________ ______________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
______________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
- afficher le texte des messages précédents -
- afficher le texte des messages précédents -To view this discussion on the web visit https://groups.google.com/d/- afficher le texte des messages précédents -msgid/tesseract-ocr/ CAFTC0i5tzPcRkrspmQ0EREVtOXQ3i fsf_rP8TQ%2B7MaU3UkURhg% 40mail.gmail.com.
| Derek |
3 avr.
|
ShreeDevi,
Thanks
for this -- I tried re-training tesseract with a range of exposure
values passed to text2image, but didn't see improved results.
However,
I did notice in the process that the x-heights for the document I was
attempting to recognize were near the lower limit of what Tesseract can
handle (~10px), so I doubled the image size. This resulted in much
improved recognition; there are still errors, but fewer of them and they
"make sense" now. Tesseract isn't able to segment the 5-column page
layout very well, but otherwise I'm pretty happy with the results.
Derek
- afficher le texte des messages précédents -
--- afficher le texte des messages précédents -
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_- 9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr. msgid/tesseract-ocr/ CAG2NduUXJuxf8orSD5OLG2A0zrbj4 BqAbs6LkgB7t0mpUEnw1A%40mail. gmail.com.
| zdenop |
3 avr.
|
Can you create a repository for your training (in sourceforge or github)?
Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).
Zdenko
Zdenko
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/- afficher le texte des messages précédents -msgid/tesseract-ocr/CABjJQf% 2BNdVKcy%3D2rf%2BwSQpPR98mN% 3DpTX4m7BzC1ZGbDxKfSidg% 40mail.gmail.com.
| Derek |
4 avr.
|
Hi Zdenko,
Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/ tesseract-georgian
Cheers,
Derek
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/- afficher le texte des messages précédents -msgid/tesseract-ocr/ CAJbzG8x6cqN%2BpqF_ sCOB4Wne0ZQg2La1gQTz8iJ4G3G% 3DiTXpuQ%40mail.gmail.com.
| zdenop |
4 avr.
|
Thanks. I put link to AddOn wiki.
Zdenko
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/- afficher le texte des messages précédents -msgid/tesseract-ocr/CABjJQf% 2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9 PQcAfKeXd_7EtHA%40mail.gmail. com.
| sibi kanagaraj |
8 avr.
|
Excellent Documentation .
A small correction in the documentation .
Here //kat.wordlist.clean / kat.word.bigrams.clean
<<Run
python count_stuff/word_counts.py
>> but the actual fie name is wordcounts.py .
-Sibi
Aucun commentaire:
Enregistrer un commentaire