Compare two big text files and write only the unique lines from first text file

Question

I have 2 big text files and i want to compare them and i want lines only that file1 has it and file2 has not it for example

File1

hello
cat
dog
human

File2

hello
human

Output file

cat
dog

There is some threads about this topic in this forum but none of them working for big files if i can't do this such a thing in python then suggest me something else

How big is the file that we are talking about? – andondraif Oct 08 '20 at 00:15 — andondraif, Oct 08 '20 at 00:15

Snavy · Answer 1 · 2020-10-08T00:55:33.597

0

A question of the same kind has been answered Here

You just have to modify it a little to output the differences instead of the similarities

with open('a.txt', 'r') as fa:
    with open('b.txt', 'r') as fb:
        diff = set(fa).difference(fb)

diff.discard('\n')

with open('d.txt', 'w') as fc:
    fc.write(''.join(diff))

Tell me if it works on your side. Took me 15.16 seconds to compare two text files of 276 824 064 bytes (285,2 MB) each, comparing a total of 16,777,216 lines of strings made up of 32 characters on an average laptop, so probably feasible for files up to Gigabytes in minutes

Here is my random text files generator program

import string
import random

def genWord(length: int = 32) -> str:
    return ''.join([random.choice(string.ascii_letters) for _ in range(length)])

def main() -> None:
    for letter in ['a', 'b']:
        filename = letter + '.txt'
        with open(filename, 'w') as file: # Opening in write mode serves as emptying the file
            pass

        with open(filename, 'a') as file: # 'a' Mode is for Append
            for _ in range(1024 * 1024 * 8): # Number of lines
                file.write(genWord(32) + '\n') # Number of letters for each pseudo-word

if __name__ == "__main__":
    main()

edited Oct 08 '20 at 00:55

answered Oct 08 '20 at 00:24

Snavy

26
3

Not working for me i getting MemoryError – Actualbury Oct 08 '20 at 02:14
How big are your files ? How much memory do you have ? – Snavy Oct 08 '20 at 13:12
one txt file is 4 gb other one 450 mb but it can be even bigger later and i have 16gb ram – Actualbury Oct 08 '20 at 16:19
What version of Python are you using ? Sounds like 32 bits limit – Snavy Oct 08 '20 at 16:39
i'm using "Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32" – Actualbury Oct 08 '20 at 16:53
You're using a 32 bit implementation of Python, uninstall and reinstall a 64 bit version. It should resolve the MemoryError – Snavy Oct 09 '20 at 01:46
i installed 64 bit and tried that script but even i try with 200mb and 95mb text files it took so long but i can see in task manager cmd using %99 of ram so it is working on something then i waited 15mins after that i cancelled the progress because you said it took 15 secs for you with 200mb each and what if i want to compare with 20gb each text file then what should i do i think that script won't work with that – Actualbury Oct 09 '20 at 15:09
Add me on Discord. Let's figure this out Snavy#2853 – Snavy Oct 09 '20 at 21:16

Compare two big text files and write only the unique lines from first text file

1 Answers1