1

There are a lot of binary diff tools out there:

and so on. They are great, but one-threaded. Is it possible to split large files on chunks, find diff between chunks simultaneously and then merge into the final delta? Any other tools, libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM? May be I could implement algorithm myself, but can not find any papers about it.

m9_psy
  • 2,557
  • 5
  • 21
  • 36
  • Looks problematic taking context into account. – SergeyA Sep 10 '15 at 22:15
  • @SergeyA: there is proprietary software (http://www.pocketsoft.com/rtpatch_binary_diff_multicore.html) that promises multithreading. If they somehow achive this, may be there is some academic papers, libs? – m9_psy Sep 10 '15 at 22:23

2 Answers2

2

ECMerge is multi threaded and able to compare huge files.

pi3
  • 1,135
  • 12
  • 14
0

libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM?

try HDiffPatch,it used in 50GB game(not test 100GB) : https://github.com/sisong/HDiffPatch
it can run fast for large file, but is not muti-thread differ;
Creating a patch: hdiffz -s-1k -c-zlib old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
diff with -s-1k & input 100GB files, requires ~ 100GB*16/1k < 2GB bytes of memory; if diff with -s-128k then less time & less memory;

bsdiff can changed to muti-thread differ:

  • suffix array sort algorithm can replace by msufsort,it's a muti-thread suffix array construction algorithm;
  • match func changed to a muti-thread version, clip new file by thread number;
  • bzip2 compresser changed to a muti-thread version,such as pbzip2 or lzma2 ...

but this way need very large of memory! (not suitable for large files)

sisong
  • 41
  • 1
  • 5
  • Yes, you right; I update the answer; I think: the main purpose of the asker is to create a patch between large files, and want to get results in a limited time and memory space; muti-thread is a direction, but have not seen a similar solution; – sisong Jun 25 '20 at 09:36