I am able to do it till the sorting part but regular expressions doesn't seems to be working in multi-lines.
Your regex is fine. You don't have multi-lines. You have single lines:
for line in s.readlines():
file.readlines()
reads all of a file into memory as a list of lines. You then the iterates over each of those single lines, so line
will be 'asd\n'
or 'qwe\n'
, and never 'qwe\nqwe\n'
.
Given that you are reading all of your merged file into memory, I'm going to presume that your files are not that big. In that case, it'd be much easier to just read one of those files into a set object, then just test each line of the other file to find the differences:
with open('a.txt', 'r') as file_a:
lines = set(file_a) # all lines, as a set, with newlines
new_in_b = []
with open('b.txt', 'r') as file_b:
for line in file_b:
if line in lines:
# present in both files, remove from `lines` to find extra lines in a
lines.remove(line)
else:
# extra line in b
new_in_b.append(line)
print('Lines in a missing from b')
for line in sorted(lines):
print(line.rstrip()) # remove the newline when printing.
print()
print('Lines in b missing from a')
for line in new_in_b:
print(line.rstrip()) # remove the newline when printing.
print()
If you wanted to write those all out to a file, you could just combine the two sequences and write out the sorted list:
with open('c.txt', 'w') as file_c:
file_c.writelines(sorted(list(lines) + new_in_b))
Your approach, sorting your lines first, putting them all in a file, and then matching paired lines, is possible too. All you need to do is remember the preceding line. Together with the current line, that's a pair. Note that you don't need a regular expression for this, just an equality test:
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
preceding = None
skip = False
for line in file_c:
if preceding and preceding == line:
# skip writing this line, but clear 'preceding' so we don't
# check the next line against it
preceding = None
else:
outfile.write(preceding)
preceding = line
# write out the last line
if preceding:
outfile.write(preceding)
Note that this never reads the whole file into memory! Iteration directly over the file gives you individual lines, where the file is read in chunks into a buffer. This is a very efficient method of processing lines.
You can also iterate over the file two lines at a time using the itertools
library to tee off the file object iterator:
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
iter1, iter2 = tee(file_c) # two iterators with shared source
line2 = next(iter2, None) # move second iterator ahead a line
# iterate over this and the next line, and add a counter
for i, (line1, line2) in enumerate(zip(iter1, iter2)):
if line1 != line2:
outfile.write(line1)
else:
# clear the last line so we don't try to write it out
# at the end
line2 = None
# write out the last line if it didn't match the preceding
if line2:
outfile.write(line2)
A third approach is to use itertools.groupby()
to group lines that are equal together. You can then decide what to do with those groups:
from itertools import groupby
with open('c.txt', 'r') as file_c, open('output.txt', 'w') as outfile:
for line, group in groupby(file_c):
# group is an iterator of all the lines in c that are equal
# the same value is already in line, so all we need to do is
# *count* how many such lines there are:
count = sum(1 for line in group) # get an efficient count
if count == 1:
# line is unique, write it out
outfile.write(line)
I'm assuming that it doesn't matter if there are 2 or more copies of the same line. In other words, you don't want pairing, you want to only find the unique lines (those only present in a or b).
If your files are extremely large but already sorted, you can use a merge sort approach, without having to merge your two files into one manually. The heapq.merge()
function gives you lines from multiple files in sorted order provided the inputs are sorted individually. Use this together with groupby()
:
import heapq
from itertools import groupby
# files a.txt and b.txt are assumed to be sorted already
with open('a.txt', 'r') as file_a, open('b.txt', 'r') as file_b,\
open('output.txt', 'w') as outfile:
for line, group in groupby(heapq.merge(file_a, file_b)):
count = sum(1 for line in group)
if count == 1:
outfile.write(line)
Again, these approaches only read enough data from each file to fill a buffer. The heapq.merge()
iterator only holds two lines in memory at a time, as does groupby()
. This lets you process files of any size, regardless of your memory constraints.