4

I've done extensive research and cannot find a combination of techniques that will achieve what I need.

I have a situation where I need to perform OCR on hundreds of W2s to extract the data for a reconciliation. The W2s are very poor quality, as they are printed and subsequently scanned back into the computer. The aforementioned process is outside of my control; unfortunately I have to work with what I've got.

I was able to successfully perform this process last year, but I had to brute force it as timeliness was a major concern. I did so by manually indicating the coordinates to extract the data from, then performing the OCR only on those segments one at a time. This year, I would like to come up with a more dynamic situation in the anticipation that the coordinates could change, format could change, etc.

I have included a sample, scrubbed W2 below. The idea is for each box on the W2 to be its own rectangle, and extract the data by iterating through all of the rectangles. I have tried several edge detection techniques but none have delivered exactly what is needed. I believe that I have not found the correct combination of pre-processing required. I have tried to mirror some of the Sudoku puzzle detection scripts.

Example W2

Here is the result of what I have tried thus far, along with the python code, which can be used whether with OpenCV 2 or 3:

Processed W2

import cv2
import numpy as np

img = cv2.imread(image_path_here)

newx,newy = img.shape[1]/2,img.shape[0]/2
img = cv2.resize(img,(newx,newy))
blur = cv2.GaussianBlur(img, (3,3),5)
ret,thresh1 = cv2.threshold(blur,225,255,cv2.THRESH_BINARY)

gray = cv2.cvtColor(thresh1,cv2.COLOR_BGR2GRAY)

edges = cv2.Canny(gray,50,220,apertureSize = 3)

minLineLength = 20
maxLineGap = 50
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)

for x1,y1,x2,y2 in lines[0]:
    cv2.line(img,(x1,y1),(x2,y2),(255,0,255),2)

cv2.imshow('hough',img)
cv2.waitKey(0)
ebeneditos
  • 2,368
  • 1
  • 12
  • 32
keyoung1
  • 53
  • 6
  • The problem is that this with this parameters vertical lines are hardly detected, try to find vertical lines with `lines_v = cv2.HoughLinesP(edges,1,np.pi,100,minLineLength,maxLineGap)` and make another loop for this. Try as well different parameters values for the `HoughLinesP` function, maybe setting different ones for horizontal and vertical lines. – ebeneditos Dec 15 '16 at 08:47

2 Answers2

3

He he, edge detection is not the only way. As the edges are thick enough (at least one pixel everywhere), binarization allows you to singulate the regions inside the boxes.

By simple criteria you can get rid of clutter, and just bounding boxes give you a fairly good segmentation.

enter image description here

Yves Daoust
  • 48,767
  • 8
  • 39
  • 84
2

Let me know if you don't follow anything in my code. The biggest faults of this concept are

1: (if you have noisy breaks in the main box line that would break it into separate blobs)

2: idk if this is a thing where there can be handwritten text, but having letters overlap the edges of boxes could be bad.

3: It does absolutely no orientation checking, (you may actually want to improve this as I don't think it would be too bad and would give you more accurate handles). What I mean is that it depends on your boxes being approximately aligned to the xy axes, if they are sufficiently skew, it will give you gross offsets to all your box corners (though it should still find them all)

I fiddled with the threshold set point a bit to get all the text separated from the edges, you could probably pull it even lower if necessary before you start breaking the main line. Also, if you are worried about line breaks, you could add together sufficiently large blobs into the final image. Processing steps

Final result

Basically, first step fiddling with the threshold to get it in the most stable (likely lowest value that still keeps a connected box) cuttoff value for separating text and noise from box.

Second find the biggest positive blob (should be the boxgrid). If your box doesnt stay all together, you may want to take a few of the highest blobs... though that will get sticky, so try to get the threshold so that you can get it as a single blob.

Last step is to get the rectangles, to do this, I just look for negative blobs (ignoring the first outer area).

And here is the code (sorry that it is in C++, but hopefully you understand the concept and would write it yourself anyhow):

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <iostream>
#include <stdio.h>
#include <opencv2/opencv.hpp>

using namespace cv;


//Attempts to find the largest connected group of points (assumed to be the interconnected boundaries of the textbox grid)
Mat biggestComponent(Mat targetImage, int connectivity=8)
{
    Mat inputImage;
    inputImage = targetImage.clone();
    Mat finalImage;// = inputImage;
    int greatestBlobSize=0;
    std::cout<<"Top"<<std::endl;
    std::cout<<inputImage.rows<<std::endl;
    std::cout<<inputImage.cols<<std::endl;

    for(int i=0;i<inputImage.cols;i++)
    {
        for(int ii=0;ii<inputImage.rows;ii++)
        {
            if(inputImage.at<uchar>(ii,i)!=0)
            {
                Mat lastImage;
                lastImage = inputImage.clone();
                Rect* boundbox;
                int blobSize = floodFill(inputImage, cv::Point(i,ii), Scalar(0),boundbox,Scalar(200),Scalar(255),connectivity);

                if(greatestBlobSize<blobSize)
                {
                    greatestBlobSize=blobSize;
                    std::cout<<blobSize<<std::endl;
                    Mat tempDif = lastImage-inputImage;
                    finalImage = tempDif.clone();
                }
                //std::cout<<"Loop"<<std::endl;
            }
        }
    }
    return finalImage;
}

//Takes an image that only has outlines of boxes and gets handles for each textbox.
//Returns a vector of points which represent the top left corners of the text boxes.
std::vector<Rect> boxCorners(Mat processedImage, int connectivity=4)
{
    std::vector<Rect> boxHandles;

    Mat inputImage;
    bool outerRegionFlag=true;

    inputImage = processedImage.clone();

    std::cout<<inputImage.rows<<std::endl;
    std::cout<<inputImage.cols<<std::endl;

    for(int i=0;i<inputImage.cols;i++)
    {
        for(int ii=0;ii<inputImage.rows;ii++)
        {
            if(inputImage.at<uchar>(ii,i)==0)
            {
                Mat lastImage;
                lastImage = inputImage.clone();
                Rect boundBox;

                if(outerRegionFlag) //This is to floodfill the outer zone of the page
                {
                    outerRegionFlag=false;
                    floodFill(inputImage, cv::Point(i,ii), Scalar(255),&boundBox,Scalar(0),Scalar(50),connectivity);
                }
                else
                {
                    floodFill(inputImage, cv::Point(i,ii), Scalar(255),&boundBox,Scalar(0),Scalar(50),connectivity);
                    boxHandles.push_back(boundBox);
                }
            }
        }
    }
    return boxHandles;
}

Mat drawTestBoxes(Mat originalImage, std::vector<Rect> boxes)
{
    Mat outImage;
    outImage = originalImage.clone();
    outImage = outImage*0; //really I am just being lazy, this should just be initialized with dimensions

    for(int i=0;i<boxes.size();i++)
    {
        rectangle(outImage,boxes[i],Scalar(255));
    }
    return outImage;
}

int main() {

    Mat image;
    Mat thresholded;
    Mat processed;

    image = imread( "Images/W2.png", 1 );
    Mat channel[3];

    split(image, channel);


    threshold(channel[0],thresholded,150,255,1);

    std::cout<<"Coputing biggest object"<<std::endl;
    processed = biggestComponent(thresholded);

    std::vector<Rect> textBoxes = boxCorners(processed);

    Mat finalBoxes = drawTestBoxes(image,textBoxes);


    namedWindow("Original", WINDOW_AUTOSIZE );
    imshow("Original", channel[0]);

    namedWindow("Thresholded", WINDOW_AUTOSIZE );
    imshow("Thresholded", thresholded);

    namedWindow("Processed", WINDOW_AUTOSIZE );
    imshow("Processed", processed);

    namedWindow("Boxes", WINDOW_AUTOSIZE );
    imshow("Boxes", finalBoxes);



    std::cout<<"waiting for user input"<<std::endl;

    waitKey(0);

    return 0;
}
Sneaky Polar Bear
  • 1,242
  • 12
  • 21