1

I have tried searching for an answer online but unfortunately without success. Therefore I am asking here:

I am trying to figure out if all lines in file1 are present in file2. Luckily I can just compare entire lines rather than individual words etc. Unluckily I am dealing with GB files so a few of the elementary solutions that I have tried gave me memory errors.

At the moment I have the following code which does not work. Some guidance would be much appreciated.

# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
    file1 = open(file1, "r")


    for line1 in file1:        
        with open(file2, "r+b") as f:

            mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
            my_str_as_bytes = str.encode(line1)
            result = mm.find(line1.strip().encode())
            print(result)
            if result == -1:
                return False
    return True

Sample file2:

This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.

Should pass if e.g. file1 is:

This is line4.
This is line5.

Should fail if e.g. file1 is:

This is line4.
This is line10.

Edit: I have just added a working version of my code for others benefit. No memory errors but its very slow.

Ali
  • 41
  • 3
  • Ick, your code is `O(m*n)`. It's trivial to do this in `O(m log m + n log n)`, and sometimes possible in `O(m + n)` – o11c Aug 25 '17 at 17:07
  • What you said about Algo complexity etc is above my head mate. – Ali Aug 29 '17 at 16:09
  • then before you learn *anything* else about programming, learn about algoritmic complexity and big-O notation. This is *important*. – o11c Aug 29 '17 at 17:46
  • Can you point me towards some layman articles for this please? Thx – Ali Aug 30 '17 at 07:44
  • wikipedia is pretty good: https://en.wikipedia.org/wiki/Big_O_notation and https://en.wikipedia.org/wiki/Analysis_of_algorithms - but any decent textbook should cover it too – o11c Aug 30 '17 at 16:18

2 Answers2

0

I'm not sure why it doesn't work but I think I know a way how you can solve it:

def is_subset_of(file1, file2):
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
        for line in f1:
            line = line.strip()
            f2.seek(0)   # go to the start of f2
            if line not in (line2.strip() for line2 in f2):
                return False
    return True

This avoids opening the second file multiple times by always seeking to the start again for each line and at any moment you only hold 2 lines in memory. That should be very memory-friendly.

Another way (that could be faster) would be to sort both file1 and file2. That way you could compare line by line and move to the next line in the other file if the string is lexically smaller than the one from the first file. Instead of O(n**2) that could be performed in O(n*log(n)). However that's much more complicated and I don't know if sorting GB of files makes sense (could use too much memory!).

MSeifert
  • 118,681
  • 27
  • 271
  • 293
  • Sorry I forgot to mention that the mmap.find() does not give me a memory issue. It just doesnt do the matching correctly. – Ali Aug 25 '17 at 17:20
  • MSeifert, Yes your code worked. I gave you an up vote but it doesnt register as I have less than 15 reputation. However your code is much slower than the mmap solution that I posted above. I was basically missing a strip() from the string and thats why it wasnt doing the matching. Thank you very much :) – Ali Aug 29 '17 at 10:10
  • @Ali Ah, so the `strip` was missing, at least something useful in the answer. If you found the solution yourself you can always post an additional answer here. It may be helpful for future readers. – MSeifert Aug 29 '17 at 10:12
  • I just did. Its still very slow compared to some thing like a unix diff. Do you think the search I am doing in mmap is the most efficient way of doing this or is there a better alternative? Thanks. – Ali Aug 29 '17 at 16:13
0

Dealing with files that don't fit in memory is always hard.

If file1 fits in memory but file2 is too large, here is a solution:

# file1 and file2 are open file-like objects
unseen = set(file1)
for line in file2:
    unseen -= {line} # avoid exception from set.remove
#if unseen is empty, all lines were found in file2

Otherwise, you should sort (or perhaps CFBS-sort) at least one of the files.

o11c
  • 13,564
  • 4
  • 46
  • 66