2

I need to edit the header information of several PDF files. What I'd like to achieve is to remove all header data before %PDF-X.Y.Z.

What I came up with as a possible solution was to open the PDF in binary mode, read each character until %PDF-X.Y.Z is found. Then continue reading the rest of the stream and save it to a new file. I thought this way I will end up with an exact binary copy of the PDF, just with different header information.

What's the easiest/best way to do this in C? Are there any libraries available that could help me do this? I'm also interested in hearing different approaches to solve this problem.

Thanks.

Dominik
  • 117
  • 1
  • 10

2 Answers2

3

Actually you can trow away all information before %PDF tag, BUT you make xref table at the end of file invalid. This table contains offset references to PDF objects.

Easiest way was: remove the part before %PDF and count, how much you trow away, reduce values in xref according.

p4553d
  • 810
  • 1
  • 7
  • 17
  • Alternatively just replace the stuff before `%PDF` with spaces! – David Heffernan Mar 11 '11 at 09:04
  • @David Heffernan: Yes, if amount is not too big, it can be acceptable, but little bit dirty, solution – p4553d Mar 11 '11 at 09:32
  • and don't forget that PDF can contain multiple xrefs in one file (most linearized PDFs contain more than one) – Bobrovsky Mar 11 '11 at 11:11
  • This is not true. In a PDF all garbage before the magic pdf header %PDF should not be included inside the XREF offsets. So you don't need to care about them and it is safe just to remove the bytes before %PDF and you will result in normally a valid PDF file. – PatrickF Sep 20 '17 at 13:58
2

Assuming that stripping the beginning of the file really does solve your problem, all you need are fopen, fread, fwrite and fclose.

You open the file for reading in binary mode. Read up until you find the magic %PDF string. Open the output file for binary writing. Write out to that file, starting with your new %PDF string. When you are done writing, close both files.

David Heffernan
  • 572,264
  • 40
  • 974
  • 1,389
  • Ok, so I'm on the right way. When the file was opend in binary mode, what's the best way to read byte after byte with `fread` and compare the value to a character? – Dominik Mar 11 '11 at 09:15
  • The naive approach is to read it into a buffer and then advance through the buffer byte by byte calling memcmp so check for a match. There's likely to be a library function for this. It would be easier in C++, and trivial in an even higher level language! – David Heffernan Mar 11 '11 at 09:25
  • I see. Reading the whole file into a buffer is probably a little dangerous in my case as the PDF files could be several GB in size. I actually chose C because I thought it's probably the easiest way to manipulate/work with binary data. So you think I'd be better off using for example C# or Python? – Dominik Mar 11 '11 at 09:36
  • You don't need to read the whole file into one buffer in one go. – David Heffernan Mar 11 '11 at 09:39