3

I have a set of fairly large files (about 50 megabytes each, and at least a hundred of them), but I need to insert a small header (about 2 dozen lines) onto each one for processing purposes. I was hoping to write a script in either bash or python to do it, but I can't find a constant-time function that will let me insert at the front of a text file. If it is not constant time, I think it'll take too long to complete. Does anyone have experience with this issue?

  • 4
    No, not in constant time. You can't prepend data to a file without rewriting the whole file. This is a fundamental limitation of filesystems, not programming languages. – Chewie May 20 '13 at 14:19
  • 2
    ...but if you later plan to `open()` the files in Python (or some other language) to "process" them, you can fake up a file-like object which 'pretends' they have the "small header" prepended to them, rather than modifying the actual files, assuming that's an acceptable alternative. – Aya May 20 '13 at 14:32

3 Answers3

4

Similar to Uwe's answer, but if your processing tool can only accept parameters as filenames, you can fake one up with mkfifo(1).

For example, in bash...

echo 'My header' > header.txt
echo 'My content' > content.txt
mkfifo fakefile.txt
cat header.txt content.txt > fakefile.txt &
cat fakefile.txt

...would stream the contents of the two files, rather than creating a new file.

Aya
  • 33,417
  • 6
  • 47
  • 52
3

You cannot insert text into a Unix file in constant time, neither at the beginning nor in the middle. On the other hand, depending on the processing that you have in mind, there is a small chance that you can avoid the insertion completely. It works if your processing tool is able to read from a pipe. Then you could do something like

cat headerfile datafile | myprocessingtool

so that the data file is not actually modified.

Uwe
  • 698
  • 4
  • 9
  • For clarification, when I said constant time, I did mean constant time with respect to the modified file. Obviously it must be linear with respect to the text being inserted. Is that still not possible? – user2401982 May 20 '13 at 15:08
  • 2
    @user2401982 When appending to a file, the time taken (at least in theory) is proportional to the size of the new data you're appending, but if you're prepending or inserting, it's proportional to the sum of the sizes of both the original file and the new data. – Aya May 20 '13 at 15:19
2

I believe this is close to the best you will do (bash):

MYHEADER=/path/to/the/header
HEADERSIZE=$(stat --format %s "$MYHEADER")

for FILENAME in $FILES; do
    OLDSIZE=$(stat --format %s "$FILENAME")
    cat "$MYHEADER" "$FILENAME" > /tmp/headerize.tmp
    NEWSIZE=$(stat --format %s /tmp/headerize.tmp)
    EXPECTEDSIZE=$(($HEADERSIZE+$OLDSIZE))
    if [ "$NEWSIZE" -eq "$EXPECTEDSIZE" ]; then
      mv /tmp/headerize.tmp "$FILENAME"
    else
      echo "Something odd happened when processing $FILENAME, headerization skipped for this file."
    fi
done

Unless you have a seriously pathetic system or way too high standards for how long is too long, this should complete in decent time. And it includes error checking. You should make sure your header ends in a newline, of course, otherwise the final header line and the first textfile line will get merged.

The only remaining optimization here is to ensure that the temporary file is written to the same filesystem as the original files; this would potentially speed up the mv command.

In general, content insertion is slow. This is true whether it's in-memory or on-disk. I believe that you will never find a constant-time-solution. But, you probably don't actually need one for a one-time batch job.

This is IMO the fastest implementation you can do in Python. Since it doesn't create a temporary file, it may be faster than the bash version:

MYHEADERPATH=/path/to/the/header
with open(MYHEADERPATH, 'r') as f:
    header = f.read()
for filename in files:
    with open(filename, 'r') as f:
        content = f.read()
    with open(filename, 'w') as f:
        f.write(header + content)

However if you wanted it to be strictly safe, you would have to do it the same way as the bash script, so there might be little speed difference ultimately.

kampu
  • 1,273
  • 1
  • 9
  • 14
  • 1
    The `OLDSIZE`/`NEWSIZE` check seems unnecessary; you don't do anything similar in the Python version. There is no (meaningful) performance difference in writing to a temporary file and renaming versus overwriting an existing file. – chepner May 20 '13 at 14:24
  • @chepner : True IFF the temporary file is on the same filesystem (`/tmp` is often in it's own filesystem. It certainly is here, on Arch Linux). I intentionally made the versions different because of this. Also, the cat command can potentially fail (out of space) and you don't want to overwrite with a truncated version. – kampu May 20 '13 at 14:30
  • @chepner: I started thinking about that, and concluded that short of making the python version mimic the bash version, my only recourse would be df, which is ridiculously involved for such a tiny script :) And the bash script is not totally safe: partial truncation (where n_truncated_bytes < header_size) could occur. To do it properly I guess I'd need to measure the header size and check for exact expected-size values. EDIT: done :) – kampu May 20 '13 at 14:41