jeudi 9 avril 2015

Cast word confidence success rate ?

Hi. I am getting the word confidences in the html ouput of tesseract
ocr engine. But confidences give me negative numbers how can i get
success rate for a word from negative confidence values ?


Thanks
Cliquez ici pour répondre
Dmitri Silaev
06/08/2011
Use this formula:confidence = min(100, max(0, 100 + 5*certainty))
where "confidence" is the value you need, "certainty" - the value
returned by Tess
Warm regards,
Dmitri Silaev
www.CustomOCR.com
- afficher le texte des messages précédents -
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
Andy Hotmail
06/08/2011
  
Hi DimitriYou kindly added me to the group when it wouldn't let me subscribe.  It now
won't let me unsubscribe so if you could unsubscribe me please it would be
much appreciated.
Thank You
Andy Syme
- afficher le texte des messages précédents -
Dmitri Silaev
06/08/2011
  
Hi Andy,Unfortunately I can't help you - I'm not in charge for moderation of
this forum. Please ask official moderators for this
Warm regards,
Dmitri Silaev
www.CustomOCR.com
- afficher le texte des messages précédents -
sventech
06/08/2011
This should help -- have you tried these methods?
http://groups.google.com/support/bin/answer.py?answer=46608
--Sven
- afficher le texte des messages précédents -
--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”
Yunus Emre Cavusoglu
08/08/2011
 
Thank you very much Dmitri
- afficher le texte des messages précédents -
Ce message a été supprimé.
Gunasekaran Velu
8 avr.
Hi Dmitri

Does your formula only for negative confidence score or for all?

Because i am getting confidence score for "Name" - 215(positive value) Is it correct or not? or Does i do any calculation for that?

Looking forward your reply.


Regards
Guna
- afficher le texte des messages précédents -
Dmitri Silaev
8 avr.
Re: [tesseract-ocr] Re: Cast word confidence success rate ?
  
It seems you're confusing "certainty" and "confidence" here. Please pay close attention to what you're writing or rephrase your question. The formula itself allows no values out of the [0, 100] range.

Best regards,
Dmitri Silaev
www.CustomOCR.com




On Wed, Apr 8, 2015 at 8:37 AM, Gunasekaran Velu <mail2...@gmail.com> wrote:
Hi Dmitri

Does your formula only for negative confidence score or for all?

Because i am getting confidence score for "Name" - 215 Is it correct or not? or Does i do any calculation for that?
- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1b9d579-f6e3-438c-b946-d3d06b1be607%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Gunasekaran Velu
8 avr.
Re: [tesseract-ocr] Re: Cast word confidence success rate ?
 
Really sorry for the mistake.
I am getting certainty value from tesseract for Text "Name" 215(Positive value). 

Does your formula applicable for this certainty value?

Kindly do the needful.

Regards
Guna

New Georgian (kartuli ena) traineddata for Tesseract


I've recently finished training tesseract 3.03-rc1 on the Georgian language, using tesstrain.sh and based off the files in the langdata repository. I created my own word list and bigrams list using Wikipedia.

Performance is very good on high-quality scans with modern fonts, but it doesn't do very well on older documents; I'm not sure whether this is because of differences in the font, or because the synthetic images generated by the tesstrain.sh script don't give tesseract enough training in handling degraded images.

I've uploaded the traineddata file and all training files here: https://dl.dropboxusercontent.com/u/11840441/kat_train20150401.zip

I'm attaching a test image (a randomly-selected scan from Georgia's registry of corporations) and the output of running tesseract recognition on the test image. No pre-processing was done on the test image except to upsample it to 300dpi. The test image contains some Latin characters so I ran tesseract with the language selector "kat+eng".

The licensing for any documents to which I hold the copyright is the same as the tesseract source, i.e. the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).
Pièces jointes (2)
NIKA_28.txt
4 Ko   Afficher   Télécharger
NIKA_28.png
696 Ko   Afficher   Télécharger
Cliquez ici pour répondre
sventech
2 avr.
Cool! Good work. I hope that will help the others who have been asking about Georgian for a couple years. :-)
--Sven

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ded0dcf6-e050-450b-8bcd-17dde924aafe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”
shree
2 avr.
Please see 

It maybe possible to do additional training using degraded versions of 'synthetic' images which may improve recognition of older documents.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFTC0i5tzPcRkrspmQ0EREVtOXQ3ifsf_rP8TQ%2B7MaU3UkURhg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
Derek
3 avr.
ShreeDevi,

Thanks for this -- I tried re-training tesseract with a range of exposure values passed to text2image, but didn't see improved results.

However, I did notice in the process that the x-heights for the document I was attempting to recognize were near the lower limit of what Tesseract can handle (~10px), so I doubled the image size. This resulted in much improved recognition; there are still errors, but fewer of them and they "make sense" now. Tesseract isn't able to segment the 5-column page layout very well, but otherwise I'm pretty happy with the results.

Derek

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/R_-9cduyixc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUXJuxf8orSD5OLG2A0zrbj4BqAbs6LkgB7t0mpUEnw1A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
zdenop
3 avr.
Can you create a repository for your training (in sourceforge or  github)?

Maybe with detailed description how you created it (so potentially other people can try to improve/extend it).


Zdenko

Zdenko

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BNdVKcy%3D2rf%2BwSQpPR98mN%3DpTX4m7BzC1ZGbDxKfSidg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
Derek
4 avr.
Hi Zdenko,

Sure, no problem -- I've made all the files, along with instructions, at https://github.com/ddohler/tesseract-georgian

Cheers,
Derek

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6cqN%2BpqF_sCOB4Wne0ZQg2La1gQTz8iJ4G3G%3DiTXpuQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
zdenop
4 avr.
Thanks. I put link to AddOn wiki.

Zdenko

- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABjJQf%2BSAn9PQ7bvmkPaOd2vbGQ07PpmCA9PQcAfKeXd_7EtHA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.
sibi kanagaraj
8 avr.
Hi Derek ,

Excellent Documentation .

A small correction in the documentation .

Here //kat.wordlist.clean / kat.word.bigrams.clean

<<Run python count_stuff/word_counts.py>>

but the actual fie name  is wordcounts.py .

-Sibi

Preprocessing - detailed cropping

Hello,

I'm trying to recognize the machine readable part of a passport. (see the last line in this picture: http://s.hswstatic.com/gif/passport-11.jpg )

I'm using Tesseract on Android (tess-two) and take the picture with a 5 Mpix mobile camera. Unfortunately, the accuracy is not satisfyingly high. What I have tried to improve recognition was cropping the picture and retraining Tesseract for the font used in a passport (ocr-b). Both raises accuracy but still not to an acceptable level.
Here is a typical cropped picture I hand to Tesseract to perform ocr:

The binarized picture created by Tess for the actual recognition looks like this:


This is what Tesseract recognizes:

  09 1 M 1 907 1 8  F8 F857<4 < W<B<O <UME
  QVWBBENO W JMGHJ <RBP6W9BQR ED



I figured that the thin line at the bottom is extremely distracting to Tesseract. If I cut off the line manually and perform ocr, results are perfectly fine and all characters are recognized.
My question is, how can I find and get rid of that line automatically if it is in the cropped picture? This has to be done on an Android phone.

Any help will be appreciated!
Mirko

Problem with recognition of numbers 3 and 8

Hi , I'm having a problem with recognition of an invoice image, the recognition is reading most of the 8 characters as 3s.

Attached is the image I'm using.

I have tried with different PSM and some basic configuration options (resolution, avoid loading dawgs).

Any help is appreciated.

Pièces jointes (1)
test1.tif
177 Ko   Afficher   Télécharger
Cliquez ici pour répondre
Dmitri Silaev
24 févr.
You need upscaling, then a bit of blurring and it should work.
For upscaling personally I tried Lanczos with a factor of 3x. This eliminates most of "8 vs. 3" errors. Don't forget that your source TIFF is BW (2 colors) so you have to save the upscaling result e.g. as a 24bit PNG.
For blurring - I used FastStone Image Viewer's Blur with a parameter of 14. If you want to use ImageMagick - I don't know how it exactly relates to Gaussian blur sigma, you have to experiment.
Then a standard command line for Tesseract works well. At least no more "8 vs. 3" errors.

Best regards,
Dmitri Silaev
www.CustomOCR.com



- afficher le texte des messages précédents -
- afficher le texte des messages précédents -
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ad762df6-4617-4184-b5c5-aedf1ec9b92c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Andy Brandt
20:06 (il y a 3 heures)
I'm having a similar issue with a font that i've trained for numbers and a few symbols only - i've attached a sample of the numbers. It is detecting 2's as 8's in my case.

I tried using a Gaussian blur and it appears to help the issue. It also appears that depending on how much or how little blur it changes the results. Do you know why this is?

Do you know if it would help to blur the images when training tesseract too?

Thanks!
Andy
- afficher le texte des messages précédents -
Pièces jointes (1)
txt.png
19 Ko   Afficher   Télécharger