I saw a few posts to find duplicates in a directory and compress: here, here, and here, but none of these posts explain how to reverse the process.
So after this process, you end up with file hashes, the file itself, and the position where they occur. In my case, I'm using the algorithm to find duplicates in a file. From a, say, 6 kB file, I reduce to 2 kB. When I try to reconstruct the file, the contents are the same, but the file size and comparison fail.
Here's my code to reconstruct the file that I modified from a previous post:
import pickle
import hashlib
with open(compressed, 'rb') as f, open(recovered, "wb") as fname:
a_dict = pickle.load(f) # loads the compressed file
a_list = []
for values_list in a_dict.values():
file_bytes = values_list.pop(0)
for val in values_list:
a_list .insert(val, file_bytes)
result = tuple(a_list)
pickle.dump(result, fname, protocol=pickle.HIGHEST_PROTOCOL)
Where a_dict
is
a_dict =
{'8a50b9f75b57104d89b58305d96045df':[b'\x94*\x08\x9d\xd8', 0, 1, 4, 6, 7],
'bff92f621cc65e2103305343a943c9a8':[b'\x85*\xe4\xf0\xd7', 2, 3, 5, 8, 9]}
Again, the contents of the original file and the result
are the same. But when I compare with Unix cmp file1 file2
or even when I hash the files again, the bytes are not the same.