7

Is it possible to get the locations of words using PDFBox, similar to "processTextPosition"? It seems that processTextPosition is called on single characters only, and the code that merges them into words is part of PDFTextStripper (in the "normalize") method, which does return the location of the text. Is there a method / utility that extracts the location as well? (For those wondering what the motivation is - the information is actually a table, and we would like to detect empty cells) Thanks

user964797
  • 231
  • 2
  • 6
  • maybe this will help: http://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox/12545981#12545981 – impeto Oct 06 '12 at 02:54
  • Thanks for the suggestion. Eventually our solution was to change writePage, to keep the words with their position (as described in the URL you sent). However, in our case, the number of columns (and their positions) is not known, and we need to find it based on the organization of the information (e.g. - if there are a lot of lines that have words that start at position Y=100, probably there is a table column there). Is there a component that can detect this structure? If so - can it handle slightly rotated pages as well, when the "Y" is not a constant? – user964797 Oct 08 '12 at 10:59
  • one possible way is to keep track of the characters by adding an override to the processTextPosition() of PDFTextStripper Class and checking for the word seperator.Keep a mark on the word start and keep a mark on the word end and save the word when a delimiter is encountered. – programer8 Dec 05 '13 at 00:21
  • @user964797 can you add your answer as an official Answer instead of just a comment? – Cel Skeggs Mar 29 '15 at 07:27

1 Answers1

2

to get words and their x and y positions in a text extracted from a pdf file you will have to extend the PdfTextStripper class and use the custom class to extract text from the pdf file eg

public class CustomPDFTextStripper extends PDFTextStripper{

    public CustomPDFTextStripper() throws IOException {

    }

    /**
    * Override the default functionality of PDFTextStripper.
    */

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException{
        TextPosition firstProsition = textPositions.get(0);
        writeString(String.format("[%s , %s , %s]", firstProsition.getTextPos().getXPosition(),
                firstProsition.getTextPos().getYPosition(), text));

    }
}

create an object of this custom class and extract text as thus

PDFTextStripper pdfStripper = new CustomPDFTextStripper();
String text = pdfStripper.getText(*pdf file wrapped as a PDDocument object*);

the resultant text string is in the form [xposition, yposition, word] separated by the default word separator

Ovokerie Ogbeta
  • 445
  • 6
  • 5