Java Pdf content to String

Question

I'm wondering if is there a way to obtain the content of a pdf file (raw bytes) as a String using Apache PdfBox 2.0.8. What I'm doing is to save the PDDocument object to a ByteArrayOutputStream and then create a new String getting ByteArrayOutputStream's byte array. But if I save the String to a file, the result is a blank pdf. The reason for this is because pdf's stream section bytes are different from a pdf created directly from PdDocument object to a file. After knowing this, I tried to get the ByteArrayOutputStream's character encoding using juniversalchardet, but no luck. So, is there a way to acomplish this? This is what I have tried so far:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
PDDocument doc = new PDDocument();
... //Add page, font, pdPageContentStream and text only to doc object with some latin chars (áéíóú)
doc.save(baos);

So, if I create a file using baos object, the pdf file looks as expected, but if I do this:

String str = new String(baos.toByteArray());

And then create a file using str bytes, the pdf file only shows a blank page. Hope I was clear enough this time :)

Please provide your research work which will help us to help you! — mannedear, Mar 21 '18 at 12:50
What exactly do you mean by *"the content of a pdf file"*? Do you mean the textual content? Or bitmap images? Or vector images? Or or or... — mkl, Mar 21 '18 at 13:59

score 1 · Answer 1 · answered Mar 21 '18 at 13:25

1

Using this, just append everything to a String.

StringBuilder sb =  new StringBuilder();
try (PDDocument document = PDDocument.load(new File("your\\path\\file.pdf"))) {
    document.getClass();
    if (!document.isEncrypted()) {
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        stripper.setSortByPosition(true);
        PDFTextStripper tStripper = new PDFTextStripper();
        String pdfFileInText = tStripper.getText(document);
        String lines[] = pdfFileInText.split("\\r?\\n");
        for (String line : lines) {
            sb.append(line);
        }
    }
}
return sb.toString();

answered Mar 21 '18 at 13:25

achAmháin

3,944
3
12
39

Can you explain the idea behind "\\r?\\n"? – TomCold Oct 12 '18 at 02:36
@TomCold it's a regex for splitting the lines - I just took it from the link I provided but you can read more here: https://stackoverflow.com/questions/454908/split-java-string-by-new-line – achAmháin Oct 12 '18 at 08:25

Java Pdf content to String

1 Answers1