1

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.

What I would like to do:

  1. Search for a ANSI string "something"
  2. Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
  3. I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.

Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?

How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here: How to look for an ANSI string in a binary file?

How can I delete n bytes and write it out to a new file efficiently?

OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.

Community
  • 1
  • 1
hyperknot
  • 12,019
  • 22
  • 87
  • 143
  • Do you want portable code? Or would you be happy with a platform-specific solution? – Oliver Charlesworth Jun 22 '11 at 23:24
  • I would prefer it to be a command line application what can be compiled both under Linux and under VS2010. Does it make it harder to compile under both Linux and VS2010? – hyperknot Jun 22 '11 at 23:27
  • 2
    If you can live with writing 3 lines of OS-dependent code, I suggest memory mapping. It is awesome for this kind of thing, and you cannot get do it any more efficient. Deleting a range equals a `memcpy` and truncating the file by the length of the removed range. Plus, it's super fast. – Damon Jun 22 '11 at 23:28
  • Or just write the file out in two chunks one before and after the segment you want to remove. – GWW Jun 22 '11 at 23:36
  • This really should be asked as several questions. – Billy ONeal Jun 22 '11 at 23:36
  • OK, now I got the ideas here. I will ask it as more questions. – hyperknot Jun 22 '11 at 23:37
  • I asked the first part here: http://stackoverflow.com/questions/6447819/how-to-look-for-an-ansi-string-in-a-binary-file – hyperknot Jun 22 '11 at 23:41
  • @Damon: it's more likely to need `memmove()` :-). – Tony Delroy Jun 23 '11 at 01:36

3 Answers3

1

There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.

Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str

Community
  • 1
  • 1
Ben Voigt
  • 260,885
  • 36
  • 380
  • 671
  • But how efficient is it to copy the 5-10 MB file into a string and use string::find? – hyperknot Jun 22 '11 at 23:45
  • @zsero: Depends on your `string::find` implementation I guess. Most are not very efficient. – Ben Voigt Jun 22 '11 at 23:46
  • @Loduwijk: `std::string` is counted, not NUL-terminated. – Ben Voigt Jun 22 '11 at 23:50
  • But every 9th character needs to be tested against multiple things, right? How is that better than testing each character against one thing? – HighCommander4 Jun 22 '11 at 23:50
  • @Ben: do you mean *on average* only every 9th character needs to be tested? Because if not, I think you're wrong. – TonyK Jun 22 '11 at 23:54
  • @Ben: I see now. Very clever! – HighCommander4 Jun 22 '11 at 23:55
  • @TonyK: Once a possible match is found, other characters need to be tested. (But the naive algorithm checks every character more than once when possible matches appear, so the speedup really is about 9x) – Ben Voigt Jun 22 '11 at 23:58
  • @TonyK: I think it's more like only every 9th character is tested as long as it is not a character in the input string. If it is a character in the input string, adjacent characters are tested until a mismatch occurs or the string is matched. – HighCommander4 Jun 22 '11 at 23:59
  • @Ben Voigt: Oh, my bad. I did not realize that. I'll delete all erroneous comments I made to that effect. – Loduwijk Jun 23 '11 at 00:12
0

For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.

ribram
  • 2,254
  • 1
  • 16
  • 20
0

First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.

As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.

I am going to assume you will load into memory or memory map it for the following.

You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.

To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.

If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.

Loduwijk
  • 1,852
  • 1
  • 15
  • 26