39

I have 2 files called "hosts" (in different directories)

I want to compare them using python to see if they are IDENTICAL. If they are not Identical, I want to print the difference on the screen.

So far I have tried this

hosts0 = open(dst1 + "/hosts","r") 
hosts1 = open(dst2 + "/hosts","r")

lines1 = hosts0.readlines()

for i,lines2 in enumerate(hosts1):
    if lines2 != lines1[i]:
        print "line ", i, " in hosts1 is different \n"
        print lines2
    else:
        print "same"

But when I run this, I get

File "./audit.py", line 34, in <module>
  if lines2 != lines1[i]:
IndexError: list index out of range

Which means one of the hosts has more lines than the other. Is there a better method to compare 2 files and report the difference?

Casimir Crystal
  • 18,651
  • 14
  • 55
  • 76
Matt
  • 753
  • 1
  • 12
  • 28
  • How about calculating a hash? As a shortcut to quickly find out if they are diffrent – matcheek Oct 01 '13 at 15:43
  • 1
    use difflib or just the diff command on the console – Andreas Jung Oct 01 '13 at 15:44
  • 2
    http://stackoverflow.com/questions/977491/comparing-2-txt-files-using-difflib-in-python – YXD Oct 01 '13 at 15:44
  • @MrE I have already seen that one. It does not answer my question. I am a beginner in python and that question talks about hash and exiting as soon as it notices a difference. I don't want to exit. I want to print out all the difference. (thank you btw) – Matt Oct 01 '13 at 15:51
  • @user2799617 I will look into difflib but the diff command is a linux command. Python doesn't recognize it..!! – Matt Oct 01 '13 at 15:54
  • @matcheek Thank you. But I don't only wanna find out if they are different. I want to read the complete files and report the difference. Therefore, I don't want python to exit as soon as it notices a difference – Matt Oct 01 '13 at 15:55
  • do you want to compare equivalent line numbers? or find out if a line is in another file? – Harpal Oct 01 '13 at 16:07
  • I want to compare equivalent line numbers. In other words, I want to see if 2 files are EXACTLY THE SAME. If they aren't, then I want to print the line(s) where they are different. Thank you – Matt Oct 01 '13 at 16:12

5 Answers5

73
import difflib

lines1 = '''
dog
cat
bird
buffalo
gophers
hound
horse
'''.strip().splitlines()

lines2 = '''
cat
dog
bird
buffalo
gopher
horse
mouse
'''.strip().splitlines()

# Changes:
# swapped positions of cat and dog
# changed gophers to gopher
# removed hound
# added mouse

for line in difflib.unified_diff(lines1, lines2, fromfile='file1', tofile='file2', lineterm=''):
    print line

Outputs the following:

--- file1
+++ file2
@@ -1,7 +1,7 @@
+cat
 dog
-cat
 bird
 buffalo
-gophers
-hound
+gopher
 horse
+mouse

This diff gives you context -- surrounding lines to help make it clear how the file is different. You can see "cat" here twice, because it was removed from below "dog" and added above it.

You can use n=0 to remove the context.

for line in difflib.unified_diff(lines1, lines2, fromfile='file1', tofile='file2', lineterm='', n=0):
    print line

Outputting this:

--- file1
+++ file2
@@ -0,0 +1 @@
+cat
@@ -2 +2,0 @@
-cat
@@ -5,2 +5 @@
-gophers
-hound
+gopher
@@ -7,0 +7 @@
+mouse

But now it's full of the "@@" lines telling you the position in the file that has changed. Let's remove the extra lines to make it more readable.

for line in difflib.unified_diff(lines1, lines2, fromfile='file1', tofile='file2', lineterm='', n=0):
    for prefix in ('---', '+++', '@@'):
        if line.startswith(prefix):
            break
    else:
        print line

Giving us this output:

+cat
-cat
-gophers
-hound
+gopher
+mouse

Now what do you want it to do? If you ignore all removed lines, then you won't see that "hound" was removed. If you're happy just showing the additions to the file, then you could do this:

diff = difflib.unified_diff(lines1, lines2, fromfile='file1', tofile='file2', lineterm='', n=0)
lines = list(diff)[2:]
added = [line[1:] for line in lines if line[0] == '+']
removed = [line[1:] for line in lines if line[0] == '-']

print 'additions:'
for line in added:
    print line
print
print 'additions, ignoring position'
for line in added:
    if line not in removed:
        print line

Outputting:

additions:
cat
gopher
mouse

additions, ignoring position:
gopher
mouse

You can probably tell by now that there are various ways to "print the differences" of two files, so you will need to be very specific if you want more help.

rbutcher
  • 865
  • 6
  • 6
  • 2
    That's exactly it. Also, is there a way to get the line number where the files are different? Because the for loop where it says "for line in diff" this "line" is the line number of the diff. I want the line number of the ORIGINAL files. – Matt Oct 02 '13 at 14:02
11

The difflib library is useful for this, and comes in the standard library. I like the unified diff format.

http://docs.python.org/2/library/difflib.html#difflib.unified_diff

import difflib
import sys

with open('/tmp/hosts0', 'r') as hosts0:
    with open('/tmp/hosts1', 'r') as hosts1:
        diff = difflib.unified_diff(
            hosts0.readlines(),
            hosts1.readlines(),
            fromfile='hosts0',
            tofile='hosts1',
        )
        for line in diff:
            sys.stdout.write(line)

Outputs:

--- hosts0
+++ hosts1
@@ -1,5 +1,4 @@
 one
 two
-dogs
 three

And here is a dodgy version that ignores certain lines. There might be edge cases that don't work, and there are surely better ways to do this, but maybe it will be good enough for your purposes.

import difflib
import sys

with open('/tmp/hosts0', 'r') as hosts0:
    with open('/tmp/hosts1', 'r') as hosts1:
        diff = difflib.unified_diff(
            hosts0.readlines(),
            hosts1.readlines(),
            fromfile='hosts0',
            tofile='hosts1',
            n=0,
        )
        for line in diff:
            for prefix in ('---', '+++', '@@'):
                if line.startswith(prefix):
                    break
            else:
                sys.stdout.write(line[1:])
rbutcher
  • 865
  • 6
  • 6
  • Thank you. This is so close to what I want. but is there a way to ONLY display dogs and nothing else? – Matt Oct 01 '13 at 16:26
  • 1
    Perfect Rated the best answer just one last thing right now the newest method you just posted will print the lines in BOTH files. for example itll print Dogs Dosg is there a way for it to ONLY print one of them? and not both? – Matt Oct 01 '13 at 16:56
  • I think I shouldn't have made it cut the first character off the line when printing. Try removing the `[1:]` from it. If something is appearing twice, it's probably because the diff considered it to be moved - removed from one place and added to another. Perhaps you could post your input files and expected output, because I'm not quite sure what you're trying to achieve. Maybe you are more concerned about unique lines and not their positions within the files? – rbutcher Oct 01 '13 at 17:21
  • Alright. for example in one of my files i have "Cats" in the other "Cast". When I run the code it does exactly what I want except the code prints "Cats Cast". I want it to ONLY print one of them. – Matt Oct 01 '13 at 18:57
  • My response wouldn't fit in a comment so I've added a new answer. – rbutcher Oct 02 '13 at 00:15
3
hosts0 = open("C:path\\a.txt","r")
hosts1 = open("C:path\\b.txt","r")

lines1 = hosts0.readlines()

for i,lines2 in enumerate(hosts1):
    if lines2 != lines1[i]:
        print "line ", i, " in hosts1 is different \n"
        print lines2
    else:
        print "same"

The above code is working for me. Can you please indicate what error you are facing?

Phillip
  • 1,969
  • 1
  • 22
  • 38
Raj
  • 331
  • 1
  • 3
  • 10
  • Thank you for answering but I'm running into this error `File "./audit.py", line 34, in ` `if lines2 != lines1[i]:` `IndexError: list index out of range` Which means one of my files have more lines than the other. – Matt Oct 01 '13 at 16:03
1

You can add an conditional statement. If your array goes beyond index, then break and print the rest of the file.

1
import difflib
f=open('a.txt','r')  #open a file
f1=open('b.txt','r') #open another file to compare
str1=f.read()
str2=f1.read()
str1=str1.split()  #split the words in file by default through the spce
str2=str2.split()
d=difflib.Differ()     # compare and just print
diff=list(d.compare(str2,str1))
print '\n'.join(diff)
Azad Mehla
  • 11
  • 3
  • 1
    simple solution just open two files and split the words and compare them with differ class. – Azad Mehla Sep 17 '15 at 11:18
  • 1
    Welcome to Stack Overflow! Please consider editing your post to add more explanation about what your code does and why it will solve the problem. An answer that mostly just contains code (even if it's working) usually wont help the OP to understand their problem. – SuperBiasedMan Sep 17 '15 at 13:40