2

I have a bunch of pdf files, some are regular pdf files which are searchable and some are scanned version of some documents which are not searchable. I would like to extract content of each pdf. To extract content of regular pdfs I use Apache Tika and to extract content from non-searchable ones I'm using tesseract-ocr. However I need to distinguish which pdf is nornal pdf and which is not. Is there any way to do that?

HHH
  • 4,945
  • 14
  • 76
  • 138
  • 1
    Try a PDF text Extractor (like Tika) first. Most likely it Returns no or very Little text. In that case Switch to OCR. – mkl Jul 08 '15 at 22:03

2 Answers2

1

This will help you,

public static boolean isSearchablePdf(String filePath) throws Exception {

    String parsedText;
    PDFTextStripper pdfStripper = null;
    PDDocument document = null;
    COSDocument cosDoc = null;
    File file = new File(filePath);
    boolean isSearchable = true;

    PDFParser parser = new PDFParser(new RandomAccessFile(file, "r"));
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    document = new PDDocument(cosDoc);
    int noOfPages = document.getNumberOfPages();

    for (int page = 1; page <= noOfPages; page++) {

        pdfStripper.setStartPage(page);
        pdfStripper.setEndPage(page);

        parsedText = pdfStripper.getText(document);
        isSearchable = isSearchable & isSearchablePDFContent(parsedText, page);

        if (!isSearchable) {
            break;
        }
        if (page >= 5) {
            break;
        }

    }
    if (isSearchable && noOfPages > 10) {
        int min = 5;
        int max = noOfPages;
        for (int i = 0; i < 4; i++) {
            int randomNo = min + (int) (Math.random() * ((max - min) + 1));
            pdfStripper.setStartPage(randomNo);
            pdfStripper.setEndPage(randomNo);
            parsedText = pdfStripper.getText(document);
            isSearchable = isSearchable & isSearchablePDFContent(parsedText, randomNo);
            if (!isSearchable)
                break;
        }
    }
    if (isSearchable && noOfPages >= 10) {
        for (int page = noOfPages - 5; page < noOfPages; page++) {
            pdfStripper.setStartPage(page);
            pdfStripper.setEndPage(page);
            parsedText = pdfStripper.getText(document);
            isSearchable = isSearchable & isSearchablePDFContent(parsedText, page);
            if (!isSearchable)
                break;

        }
    }

    if (document != null){
        document.close();
    }

    return isSearchable;
}

public static boolean isSearchablePDFContent(String contentOfPdf, int pageNo) throws IOException {
    int count = 0;
    boolean isSearchable = false;
    if (!contentOfPdf.isEmpty()) {
        StringTokenizer st = new StringTokenizer(contentOfPdf);
        while (st.hasMoreTokens()) {
            st.nextToken();
            if (count >= 3) {
                isSearchable = true;
                break;
            }
            count++;
        }

    } else {
        isSearchable = false;
    }

    return isSearchable;
} 
sagar vyas
  • 111
  • 1
  • 5
0

Can you not use Tika to extract text and images and once you know there is very little text, you just feed the images to tesseract? According to this answer Extract Images from PDF with Apache Tika image extraction should be possible, at least in theory.

In case Tika does not work you should be able to use PDFbox to take the same approach. See How to read PDF files using Java? for the general text extraction part and extract images from pdf using pdfbox for hints on the image extraction.

Community
  • 1
  • 1
ikkjo
  • 535
  • 7
  • 13