How to split PDF with PDFBox after random number of pages?

Question

I want to split a big PDF document with a specific criteria. Each page has the following header:

The code (in the PDF Header image is 000084292) can change after a random number of pages (I can have a new code after 1 page or maybe 3 or 10 ...). So everytime I found a new code in the header, I must create a new file with file name = code (for example: if my current code is 000084292 when I find the code 000084345, I must create the file 000084292.pdf, keep reading the original PDF document, if I find the code 000087209, I must create the file 000084345.pdf ....).

This is a scanned PDF, so first convert to images (https://stackoverflow.com/questions/23326562/ ), then use ZXing to get the barcode. To create new PDFs, either use the splitter class, or create a new PDF and add from the existing PDF with addPage(). — Tilman Hausherr, Jul 03 '20 at 10:40
If scanning problems like at the left side can also occur in the rest of the header, be prepared to also be confronted with unreadable barcodes you have to process separately. — mkl, Jul 03 '20 at 11:08
Hi sorry, you're right .. I forgot the previous steps of my program: 1. The customer send me the PDF file, this file is an image 2. Convert the file in a readable format (with OCR programs) 3. Read PDF with Java and split it by the code Normally the code is readable, but I know sometimes not (so I also need to manage this case). I hope is more clear now.. — Francesco Gioli, Jul 03 '20 at 11:44
If it is an OCRed PDF, then extract the text and find out whether there is a reliable way to get these barcodes (e.g. with regular expressions) — Tilman Hausherr, Jul 03 '20 at 12:15
Unfortunately I don't have any control on the conversion Image to Text. If there is a method to convert the Image to Text from Java maybe I can directly split the file, but there is not. So I must work directly on the PDF file (text converted) — Francesco Gioli, Jul 03 '20 at 12:29
PDFBox can do text extraction (not OCR. I mean the text that is in the PDF after an OCR). Look for "ExtractTextSimple.java" in the source code download. Or run the ExtractText command line application. — Tilman Hausherr, Jul 03 '20 at 13:13
Sorry for my english, maybe I can't explain the problem. I already have a readable PDF (so with my Java program I can read the content). But when, during the reading of the file, I find a new code (in the header of the document) I need to split. But as I know the split function works only with pages or after a certain number of pages. I don't know when I must split because the code can be different after one page or two or 10 (I don't know).. — Francesco Gioli, Jul 03 '20 at 14:25
If you know that pages 10, 11, 12, 13 and 14 are part of a sequence, then you get call PDPage.get() on these and call addPage() to a new PDDocument object. — Tilman Hausherr, Jul 04 '20 at 08:53

How to split PDF with PDFBox after random number of pages?

0 Answers0