I have tried searching for an answer online but unfortunately without success. Therefore I am asking here:
I am trying to figure out if all lines in file1
are present in file2
. Luckily I can just compare entire lines rather than individual words etc. Unluckily I am dealing with GB files so a few of the elementary solutions that I have tried gave me memory errors.
At the moment I have the following code which does not work. Some guidance would be much appreciated.
# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
file1 = open(file1, "r")
for line1 in file1:
with open(file2, "r+b") as f:
mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
my_str_as_bytes = str.encode(line1)
result = mm.find(line1.strip().encode())
print(result)
if result == -1:
return False
return True
Sample file2:
This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.
Should pass if e.g. file1 is:
This is line4.
This is line5.
Should fail if e.g. file1 is:
This is line4.
This is line10.
Edit: I have just added a working version of my code for others benefit. No memory errors but its very slow.