PDFBox retrieve text from overlapping boxes

Question

I've had some success using the PDFTextStripperByArea class to retrieve text contained within a specified rectangle. However, some of the PDFs I an scraping have text that is in slightly different places from page to page. I'm looking for help in how to deal with this.

In the example below, I can open the PDF in Acrobat Edit mode and see multiple text boxes (outlines with thin grey lines). I have indicated two regions (purple and red) that I would like to extract text from. However, instead of just getting the text physically inside the rectangle, I'd like all the text from the overlapping text boxes.

Is there a way to do this?

please share an example pdf. It is not entirely clear what these "text boxes" are in pdf syntax. They might be all the text drawn add part of a single text object, or they might be all the text drawn inside a rectangle path, or something else entirely. — mkl, Oct 13 '17 at 04:29
@mkl The grey boxes are just what comes up in Acrobat when I use Edit mode. I can't see any concept that matches in PDFBox (I thought maybe beads or articles, but think not). Documents have sensitive data so can't share here. I'll see if I can find something else less sensitive with same type of content. — beldaz, Oct 14 '17 at 20:57
@mkl Please see https://gist.github.com/beldaz/8d658c7ae8d9cb9402ca61f4256c4319 where the text in the bottom right of the page is editable in Acrobat as 7 distinct text boxes. I generated this by replacing the text of an existing PDF, not from scratch. — beldaz, Oct 14 '17 at 21:30

PDFBox retrieve text from overlapping boxes

0 Answers0

Linked