0

I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?

Prabhjot Rai
  • 27
  • 1
  • 3
  • 1
    This type of question usually gets flagged. Till then, look for the PrintTextLocations example in PDFBox. In the 2.0 sources, there's also the DrawPrintTextLocations example which is the same on steroids. – Tilman Hausherr Dec 08 '15 at 11:30
  • I am looking forward to read the pdf line by line. Can you help me by providing a link of a book/doc if you know one? My idea was to use x-axis to read the characters line by line by knowing their positions. – Prabhjot Rai Dec 08 '15 at 12:31
  • That's a different question. To read line by line, just use the PDFTextStripper class. https://pdfbox.apache.org/1.8/cookbook/textextraction.html – Tilman Hausherr Dec 08 '15 at 12:36

2 Answers2

0

Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure

Basically, with PDFBox, you can aces to the PDFContent with

InputStream is = yourPDFDocument.getDocumentCatalog().getPages().get(yourPage).getContents();

and then, search for the X Y Td line you're looking for.

I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !

I hope this will help you :)

Cook
  • 69
  • 1
  • 7
  • **(A)** The **Td** operation is not the only way to position text. **(B)** The parameters of the **Td** operation may be subject to transformations and, therefore, not contain the coordinates the OP searches. **(C)** You only look at the page content stream and completely ignore content streams of form xobjects. **(D)** For extracting text with locations with PdfBox one should derive a solution from the `PDFTextStripper` class. – mkl Dec 09 '15 at 16:52
0

You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.

./textricator.bat text --pages=2 xxx.pdf

# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.  
not2qubit
  • 10,014
  • 4
  • 72
  • 101