6

I have thousands of PDF files that I need to convert into txt files, but I need to preserve the original line breaks, let me give an example to better explain. The files are in this format

(example A)

1. Lorem ipsum dolor sit amet, consectetur adipiscing elit
2. Lorem ipsum dolor sit amet, consectetur adipiscing elit
3. Lorem ipsum dolor sit amet, consectetur adipiscing elit
4. Lorem ipsum dolor sit amet, consectetur adipiscing elit

the lines are very long, much longer than the one I use above, so in the PDF they get wrapped around like this

(example B)

1. Lorem ipsum dolor sit amet, 
consectetur adipiscing elit
2. Lorem ipsum dolor sit amet, 
consectetur adipiscing elit
3. Lorem ipsum dolor sit amet, 
consectetur adipiscing elit
4. Lorem ipsum dolor sit amet, 
consectetur adipiscing elit

How do I get the text like example A without the wrapping line breaks? I have tried using PHP PDFParser library, Python PDFMiner, XPDF pdftotxt, none of them worked, they either gave me example B, or a downright mess.

The thing that made me think this is possible is if I use the online service http://pdf2doc.com/ it gives me example A, just the way I want it, then I can just save the the doc as txt.

daxter1992
  • 448
  • 3
  • 10
  • Can you share a sample file? If the file is tagged accordingly, the information required for your task might be present in those tags. Otherwise it might be educated guesswork by the Web service. – mkl May 31 '16 at 05:09
  • Here is a sample file https://www.dropbox.com/s/1a9xfk24vf93tk6/UU_NO_17_2012.PDF?dl=0 What kind of tags would I look for? – daxter1992 May 31 '16 at 13:43
  • I am going to agree with @mkl, the content in the PDF may or may not contain carriage returns and the tool you are using to read the content out of the PDF may or may not pay attention to those. You need something that will either look at the tag structure of the PDF (if the tag structure exists) or something that will create a tag structure algorithmically by looking at the content. If you have control over how the input files are created, this is a much easier problem to solve. – Brandon Haugen May 31 '16 at 14:03
  • Try Qoppa PDF Studio (desktop application where you can do a single document or a batch) or jPDFText (java library). – Amber May 31 '16 at 20:27
  • I looked into your sample PDF. It is neither tagged nor are there any clear hints in the content stream where a new paragraph starts. Any application or service which combines the individual lines of text in the PDF into paragraphs does so using educated guesswork, i.e. deducing from the space between lines, indention of lines, punctuation marks, font, font size, etc. where paragraphs start and end. – mkl Jun 01 '16 at 06:29
  • Hey , did you found any solution yet? I am also looking for same please advise. – Abhishek B May 21 '18 at 11:50

0 Answers0