0

I have two unsorted text files (between 150MB and 1GB in size).

I want to find all the lines that are present in a.txt and not in b.txt.

a.txt contains-->

qwe
asd
zxc
rty

b.txt contains-->

qwe
zxc

If I combine a.txt and 'b.txtinc.txt` I get:

qwe
asd
zxc
rty
qwe
zxc

I sort them alphabetically and get:

asd
qwe
qwe
rty
zxc
zxc

Then I use regx mode to search for (.*)\n(\1)\n and replace them all with null and then I replace all \n\n multiple times with \n to get the "difference" between two files.

Now I am unable to do so in python. I am able to do it till the sorting part but regular expressions doesn't seems to be working in multi-lines. Here is my python code

f = open("output.txt", 'w')
s = open(outputfile,'r+')
for line in s.readlines():
    s = line.replace('(.*)\n(\1)\n', '')
    f.write(s)

f.close() 
jwpfox
  • 4,786
  • 11
  • 41
  • 42

1 Answers1

0

I am able to do it till the sorting part but regular expressions doesn't seems to be working in multi-lines.

Your regex is fine. You don't have multi-lines. You have single lines:

for line in s.readlines():

file.readlines() reads all of a file into memory as a list of lines. You then the iterates over each of those single lines, so line will be 'asd\n' or 'qwe\n', and never 'qwe\nqwe\n'.

Given that you are reading all of your merged file into memory, I'm going to presume that your files are not that big. In that case, it'd be much easier to just read one of those files into a set object, then just test each line of the other file to find the differences:

with open('a.txt', 'r') as file_a:
    lines = set(file_a)  # all lines, as a set, with newlines

new_in_b = []
with open('b.txt', 'r') as file_b:
    for line in file_b:
        if line in lines:
            # present in both files, remove from `lines` to find extra lines in a
            lines.remove(line)
        else:
            # extra line in b
            new_in_b.append(line)

print('Lines in a missing from b')
for line in sorted(lines):
    print(line.rstrip())  # remove the newline when printing.
print()

print('Lines in b missing from a')
for line in new_in_b:
    print(line.rstrip())  # remove the newline when printing.
print()

If you wanted to write those all out to a file, you could just combine the two sequences and write out the sorted list:

with open('c.txt', 'w') as file_c:
    file_c.writelines(sorted(list(lines) + new_in_b))

Your approach, sorting your lines first, putting them all in a file, and then matching paired lines, is possible too. All you need to do is remember the preceding line. Together with the current line, that's a pair. Note that you don't need a regular expression for this, just an equality test:

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    preceding = None
    skip = False
    for line in file_c:
        if preceding and preceding == line:
            # skip writing this line, but clear 'preceding' so we don't
            # check the next line against it
            preceding = None
        else:
            outfile.write(preceding)
            preceding = line
    # write out the last line
    if preceding:
        outfile.write(preceding)

Note that this never reads the whole file into memory! Iteration directly over the file gives you individual lines, where the file is read in chunks into a buffer. This is a very efficient method of processing lines.

You can also iterate over the file two lines at a time using the itertools library to tee off the file object iterator:

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    iter1, iter2 = tee(file_c)  # two iterators with shared source
    line2 = next(iter2, None)  # move second iterator ahead a line
    # iterate over this and the next line, and add a counter
    for i, (line1, line2) in enumerate(zip(iter1, iter2)):
        if line1 != line2:
            outfile.write(line1)
        else:
            # clear the last line so we don't try to write it out
            # at the end
            line2 = None
    # write out the last line if it didn't match the preceding
    if line2:
        outfile.write(line2)

A third approach is to use itertools.groupby() to group lines that are equal together. You can then decide what to do with those groups:

from itertools import groupby

with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
    for line, group in groupby(file_c):
        # group is an iterator of all the lines in c that are equal
        # the same value is already in line, so all we need to do is
        # *count* how many such lines there are:
        count = sum(1 for line in group)  # get an efficient count
        if count == 1:
            # line is unique, write it out
            outfile.write(line)

I'm assuming that it doesn't matter if there are 2 or more copies of the same line. In other words, you don't want pairing, you want to only find the unique lines (those only present in a or b).

If your files are extremely large but already sorted, you can use a merge sort approach, without having to merge your two files into one manually. The heapq.merge() function gives you lines from multiple files in sorted order provided the inputs are sorted individually. Use this together with groupby():

import heapq
from itertools import groupby

# files a.txt and b.txt are assumed to be sorted already
with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\
        open('output.txt', 'w') as outfile:
    for line, group in groupby(heapq.merge(file_a, file_b)):
        count = sum(1 for line in group)
        if count == 1:
            outfile.write(line)

Again, these approaches only read enough data from each file to fill a buffer. The heapq.merge() iterator only holds two lines in memory at a time, as does groupby(). This lets you process files of any size, regardless of your memory constraints.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • My files are are in the range of 150 MBs to 1 GB, all containing image file name in different line like 201805002113_P.jpg – Sanjay Wadhwa Jul 05 '18 at 13:23
  • @SanjayWadhwa: that's up to 50 million lines then, that does become a little.. large. Are your input files sorted at all? – Martijn Pieters Jul 05 '18 at 13:32
  • @SanjayWadhwa: your own approach can't handle that very well either, because you are reading *all lines of `c.txt` into memory*. With up to 100 million lines (two 1GB files of filenames) you *will* run out of memory. – Martijn Pieters Jul 05 '18 at 13:33
  • I know my approach will not work... And no... they aren't sorted at all – Sanjay Wadhwa Jul 05 '18 at 13:41
  • @SanjayWadhwa: I'd get them sorted then; there are efficient command-line tools that can get you a sorted file output. Even if you only sort *subsets of each file*, you can use `heapq.merge()` to sort together any number of subsets if each subset is sorted. – Martijn Pieters Jul 05 '18 at 13:57