8

I'm trying to solve some simple captcha using OpenCV and pytesseract. Some of captcha samples are:

enter image description here enter image description here enter image description here enter image description here

I tried to the remove the noisy dots with some filters:

import cv2
import numpy as np
import pytesseract

img = cv2.imread(image_path)
_, img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
img = cv2.morphologyEx(img, cv2.MORPH_OPEN, np.ones((4, 4), np.uint8), iterations=1)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.medianBlur(img, 3)
img = cv2.GaussianBlur(img, (5, 5), 0)
cv2.imwrite('res.png', img)
print(pytesseract.image_to_string('res.png'))

Resulting tranformed images are:

enter image description here enter image description here enter image description here enter image description here

Unfortunately pytesseract just recognizes first captcha correctly. Any other better transformation?

Final Update:

As @Neil suggested, I tried to remove noise by detecting connected pixels. To find connected pixels, I found a function named connectedComponentsWithStats, whichs detect connected pixels and assigns group (component) a label. By finding connected components and removing the ones with small number of pixels, I managed to get better overall detection accuracy with pytesseract.

And here are the new resulting images:

enter image description here enter image description here enter image description here enter image description here

Mehran Torki
  • 877
  • 1
  • 7
  • 32

4 Answers4

2

I've taken a much more direct approach to filtering ink splotches from pdf documents. I won't share the whole thing it's a lot of code, but here is the general strategy I adopted:

  1. Use Python Pillow library to get an image object where you can manipulate pixels directly.
  2. Binarize the image.
  3. Find all connected pixels and how many pixels are in each group of connected pixels. You can do this using the minesweeper algorithm. Which is easy to search for.
  4. Set some threshold value of pixels that all legitimate letters are expected to have. This will be dependent on your image resolution.
  5. replace all black pixels in groups below the threshold with white pixels.
  6. Convert back to image.
Neil
  • 2,116
  • 2
  • 12
  • 28
1

Your final output image is too blurry. To enhance the performance of pytesseract you need to sharpen it.

Sharpening is not as easy as blurring, but there exist a few code snippets / tutorials (e.g. http://datahacker.rs/004-how-to-smooth-and-sharpen-an-image-in-opencv/).

Rather than chaining blurs, blur once either using Gaussian or Median Blur, experiment with parameters to get the blur amount you need, perhaps try one method after the other but there is no reason to chain blurs of the same method.

Leon
  • 33
  • 5
1

There is an OCR example in python that detect the characters. Save several images and apply the filter and train a SVM algorithm. that may help you. I did trained a algorithm with even few Images but the results were acceptable. Check this link. Wish you luck

1

I know the post is a bit old but I suggest you to try this library I've developed some time ago. If you have a set of labelled captchas that service would fit you. Take a look: https://github.com/punkerpunker/captcha_solver

In README there is a section "Train model on external data" that you might be interested in.

Gleb V
  • 773
  • 2
  • 9