3

I am using pytesser to OCR a small image and get a string from it:

image= Image.open(ImagePath)
text = image_to_string(image)
print text

However, the pytesser loves to sometimes recognize and return non-ascii characters. The problem occurs when I want to now print what I just recognized. In python 2.7 (which is what I am using), the program crashes.

Is there some way to make it so pytesser does not return any non-ascii characters? Perhaps there is something you can change in tesseract OCR?

Or, is there some way to test a string for non-ascii characters (without crashing the program) and then just not print that line?

Some would suggest using python 3.4 but from my research it looks like pytesser does not work with it: Pytesser in Python 3.4: name 'image_to_string' is not defined?

Community
  • 1
  • 1
Micro
  • 9,083
  • 8
  • 68
  • 104

2 Answers2

4

I would go with Unidecode. This library converts non-ASCII characters to most similar ASCII representation.

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

It should work perfectly!

Fabio Menegazzo
  • 941
  • 7
  • 9
  • Alternatively, if user wants to remove the unicode, they can follow this post: http://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2-7 – wbest Jul 24 '14 at 17:01
  • was giving a TypeError: 'module' object is not callable. made a small change. `from unidecode import unidecode` – Sreeragh A R Sep 12 '17 at 06:18
0

Is there some way to make it so pytesser does not return any non-ascii characters?

You could limit the characters recognizable by tesseract by using the option tessedit_char_whitelist.

For instance:

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

See also: https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

Giovanni Cappellotto
  • 3,377
  • 27
  • 30