0

I'm writing a Python backend for Resumable.js, which allows uploading large files from a browser by splitting them into smaller chunks on the client.

Once the server has finished saving all chunks into a temporary folder, it needs to combine them. Individual chunks are quite small (1 MB by default) binary files, but their total size could be possibly larger than the web server's available memory.

How would you do the combining step in Python? Say a folder only contains n files, with names: "1", "2", "3"...

Can you explain how:

  • read()
  • write(.., 'wb')
  • write(.., 'ab')
  • shutil.copyfileobj()
  • mmap

would work in this case and what would be the recommended solution, based on these memory requirements?

hyperknot
  • 12,019
  • 22
  • 87
  • 143

2 Answers2

2

Sticking to a purely pythonic solution (I assume you have your reasons for not going with 'cat' in Linux or 'copy' in Windows):

with open('out_bin','wb') as wfd:
    for f in filepaths:
        with open(f,'rb') as fd:
            # 1MB per writing chunk.
            shutil.copyfileobj(fd, wfd, 1024 * 1024 * 1)

will get the job done reliably and efficiently.

Key points being writing and reading in binary mode ('wb', 'rb') so as to avoid pollution of the final result with unsolicited newline conversions that could otherwise happen, corrupting the final binary.

If you are looking for the fastest approach then you might need to benchmark against the other methods you indicated an interest in, and I don't see any guarantees that the winner of said benchmark would not be somewhat OS dependent.

David Simic
  • 1,651
  • 1
  • 14
  • 31
0

Think outside the box. The easiest way to do this in a Unix-esqe environment is something like:

cat file1 file2 file3 file4 > output

No need to read the files directly. On Windows it would be

C:\ copy file1 file2 file3 file4 output

To do this there is a great post on how to run command line programs in Linux.

Community
  • 1
  • 1
stdunbar
  • 10,999
  • 9
  • 26
  • 38