Reverse process after finding duplicates

Question

I saw a few posts to find duplicates in a directory and compress: here, here, and here, but none of these posts explain how to reverse the process.

So after this process, you end up with file hashes, the file itself, and the position where they occur. In my case, I'm using the algorithm to find duplicates in a file. From a, say, 6 kB file, I reduce to 2 kB. When I try to reconstruct the file, the contents are the same, but the file size and comparison fail.

Here's my code to reconstruct the file that I modified from a previous post:

import pickle
import hashlib

with open(compressed, 'rb') as f, open(recovered, "wb") as fname:
    a_dict = pickle.load(f) # loads the compressed file
    a_list = []
    for values_list in a_dict.values():
        file_bytes = values_list.pop(0)
        for val in values_list:
            a_list .insert(val, file_bytes)
        result = tuple(a_list)
        pickle.dump(result, fname, protocol=pickle.HIGHEST_PROTOCOL)

Where a_dict is

a_dict = 
{'8a50b9f75b57104d89b58305d96045df':[b'\x94*\x08\x9d\xd8', 0, 1, 4, 6, 7],
 'bff92f621cc65e2103305343a943c9a8':[b'\x85*\xe4\xf0\xd7', 2, 3, 5, 8, 9]}

Again, the contents of the original file and the result are the same. But when I compare with Unix cmp file1 file2 or even when I hash the files again, the bytes are not the same.

can you try to pickle load both files and compare that ? that would be clearer. Or use `json` to dump your data instead of `pickle` so you can see _why_ they differ. — Jean-François Fabre, Aug 05 '18 at 19:40
To debug this, create a file 'aba.txt' where each line is 1023 'a' (or 'b') characters followed by newline. Then use `diff -u aba.txt reconstructed.txt` to understand what went south. Also, you mentioned the sizes differ. By how much? Is there maybe some LF vs CRLF trouble, or does one file perhaps start with the unicode BOM byte order marker? Use `hexdump -C` to verify details. — J_H, Aug 05 '18 at 20:09

Jean-François Fabre · Answer 1 · 2018-08-05T19:48:08.750

0

when you're doing:

for values_list in a_dict.values():

the values are iterated upon but the order can change between subsequent runs. It's a security feature of the Python 3 hash (hash randomization, read here: Why is dictionary ordering non-deterministic?) function which changes (unless you fix it with an env. variable).

So your resulting list data (in a_list) is the same but in a different order. I suggest that you sort the values when iterating upon to stabilize them:

for values_list in sorted(a_dict.values()):

For native python structures like you have, I'd recommend using json to serialize them. You could have figured out the issue yourself, seeing that the contents were ok, just in a different order.

edited Aug 05 '18 at 19:48

answered Aug 05 '18 at 19:42

Jean-François Fabre

126,787
22
103
165

I printed the `original` and `output` and then used diff checker to confirm that they are the same. I don't think the issue is with the dictionary though, as I'm reconstructing based on the list values. So, value `0` and `1` on the list return two identical blocks, followed by `2` and `3` which return two other identical blocks (different than the previous one) As for the `json` I had issues since I'm dealing with bytes. – Aug 05 '18 at 20:20

Reverse process after finding duplicates

1 Answers1