Tesseract OCR examples: Preprocessing

Hello,

I'm trying to recognize the machine readable part of a passport. (see the last line in this picture: http://s.hswstatic.com/gif/passport-11.jpg )

I'm using Tesseract on Android (tess-two) and take the picture with a 5 Mpix mobile camera. Unfortunately, the accuracy is not satisfyingly high. What I have tried to improve recognition was cropping the picture and retraining Tesseract for the font used in a passport (ocr-b). Both raises accuracy but still not to an acceptable level.
Here is a typical cropped picture I hand to Tesseract to perform ocr:

The binarized picture created by Tess for the actual recognition looks like this:

This is what Tesseract recognizes:

09 1 M 1 907 1 8 F8 F857<4 < W<B<O <UME
QVWBBENO W JMGHJ <RBP6W9BQR ED

I figured that the thin line at the bottom is extremely distracting to Tesseract. If I cut off the line manually and perform ocr, results are perfectly fine and all characters are recognized.
My question is, how can I find and get rid of that line automatically if it is in the cropped picture? This has to be done on an Android phone.

Any help will be appreciated!
Mirko

Tesseract OCR examples

jeudi 9 avril 2015

Preprocessing - detailed cropping

Aucun commentaire:

Enregistrer un commentaire