Tesseract OCR examples: New Georgian (kartuli ena) traineddata for Tesseract

I've recently finished training tesseract 3.03-rc1 on the Georgian language, using tesstrain.sh and based off the files in the langdata repository. I created my own word list and bigrams list using Wikipedia.

Performance is very good on high-quality scans with modern fonts, but it doesn't do very well on older documents; I'm not sure whether this is because of differences in the font, or because the synthetic images generated by the tesstrain.sh script don't give tesseract enough training in handling degraded images.

I've uploaded the traineddata file and all training files here: https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip

I'm attaching a test image (a randomly-selected scan from Georgia's registry of corporations) and the output of running tesseract recognition on the test image. No pre-processing was done on the test image except to upsample it to 300dpi. The test image contains some Latin characters so I ran tesseract with the language selector "kat+eng".

The licensing for any documents to which I hold the copyright is the same as the tesseract source, i.e. the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).

Pièces jointes (2)

NIKA_28.txt

4 Ko Afficher Télécharger

NIKA_28.png

696 Ko Afficher Télécharger

Cliquez ici pour répondre

sventech

2 avr.

Traduire le message en français

Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)

--Sven

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

shree

2 avr.

Traduire le message en français

Please see

https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.h

https://code.google.com/p/tesseract-ocr/source/browse/training/degradeimage.cpp

It maybe possible to do additional training using degraded versions of 'synthetic' images which may improve recognition of older documents.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Derek

3 avr.

Traduire le message en français

ShreeDevi,

Thanks for this -- I tried re-training tesseract with a range of exposure values passed to text2image, but didn't see improved results.

However, I did notice in the process that the x-heights for the document I was attempting to recognize were near the lower limit of what Tesseract can handle (~10px), so I doubled the image size. This resulted in much improved recognition; there are still errors, but fewer of them and they "make sense" now. Tesseract isn't able to segment the 5-column page layout very well, but otherwise I'm pretty happy with the results.

Derek

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

zdenop

3 avr.

Traduire le message en français

Can you create a repository for your training (in sourceforge or github)?

Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).

Zdenko

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Derek

4 avr.

Traduire le message en français

Hi Zdenko,

Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/tesseract-georgian

Cheers,

Derek

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

zdenop

4 avr.

Traduire le message en français

Thanks. I put link to AddOn wiki.

Zdenko

- afficher le texte des messages précédents -

- afficher le texte des messages précédents -

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

sibi kanagaraj

8 avr.

Traduire le message en français

Hi Derek ,

Excellent Documentation .

A small correction in the documentation .

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>>

but the actual fie name is wordcounts.py .

-Sibi

Tesseract OCR examples

jeudi 9 avril 2015

New Georgian (kartuli ena) traineddata for Tesseract

Aucun commentaire:

Enregistrer un commentaire