1

I am not an expertise in Python and I tried searching my best to find the answer, but couldn't find. Pardon me, If this is a duplicate question and point me in the right direction, if you can.

I am trying to compare two CSV files using Python Difflib and generating the Diff output as a HTML page. The current difflib module has the inbuilt options as -m to generate the HTML output of the two csv files side by side by highlighting the differences.

However, the difflib uses difflib.SequenceMatcher to find the differences and create a HTML file using difflib.HtmlDiff.make_file . But, the output for that it produces is not that I want.

The output I am getting currently from the difflib is :The Default Python DIFFLIB HTML output is Here.

However, The output I want is : I am looking for a word level highlight instead of the changes that are highlighted either at character level or a sequence highlight. I am looking to highlight the WHOLE WORD if any changed that happened between old file and new file.

The changes that I want to highlight is: A word Level highlight of the text.

Please help me in this regard, whether this is really possible with difflib or do I have to use any other tools/modules. I tried using vimdiff and other plugins, but I am left with nothing. I am open to any thing here.

The code I am using is from PythonDiffLib docs page.

import sys, os, time, difflib, optparse
  def main():
   ..
   ..
   ..
    n = options.lines //I used this n = ZERO.
    fromfile, tofile = args # as specified in the usage string

    # we're passing these as arguments to the diff function
    fromdate = time.ctime(os.stat(fromfile).st_mtime)
    todate = time.ctime(os.stat(tofile).st_mtime)
    fromlines = open(fromfile, 'U').readlines()
    tolines = open(tofile, 'U').readlines()

    diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
                                            tofile, context=TRUE,
                                            numlines=0)

    # we're using writelines because diff is a generator
    sys.stdout.writelines(diff)

` Old.csv

refno,title,author,year,price
1001,CPP,MILTON,2008,456
1002,JAVA,Gilson,2002,456
1003,Adobe Flex,2010,566
1004,General Knowledge,Sinson,2007,465
1005,Actionscript,Gilto,2008,480

new.csv

refno,title,author,year,price
1001,CPP,MILTON,2010,456,2008
1002,JAVA,Gilson,2002
1003,Adobe Flexi,Johnson,2010,566
1004,General Knowledge,Simpson,2007,465
105,Action script,Gilto,2008,480
2000,Drama,DayoNe,,2020,560

I am also adding the Default HTML DIFF Output and Expected HTML DIFF output below.

Default HTML DIFF Output from DIFFLIB:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,200<span class="diff_sub">8</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,20<span class="diff_add">1</span>0,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex<span class="diff_add">i,Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">n</span>son,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">mp</span>son,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap">1<span class="diff_sub">0</span>05,Actionscript,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap">105,Action<span class="diff_add">&nbsp;</span>script,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>

Expected HTML DIFF Output from DIFFLIB:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_sub">2008</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_add">2010</span>,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,<span class="diff_sub">Adobe&nbsp;Flex</span>,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,<span class="diff_add">Adobe&nbsp;Flexi</span>,<span class="diff_add">Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_sub">Sinson</span>,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_add">Simpson</span>,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap"><span class="diff_sub">1005</span>,<span class="diff_sub">Actionscript</span>,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap"><span class="diff_add">105</span>,<span class="diff_add">Action&nbsp;script</span>,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>
John
  • 13
  • 7
  • What I would try to do is find the tagged segments and extend the tagging out to word boundaries. – Terry Jan Reedy Jun 01 '17 at 22:00
  • Alternative is to copy difflib code and modify it to tag words. – Terry Jan Reedy Jun 01 '17 at 22:01
  • I tried to do that. However, I am not a great python expert, I looked inside the difflib, it uses the SequenceMatcher and OpCodes for tagging. I couldn't find a way to tag it for the words. Could you please tell me where to look for this. – John Jun 01 '17 at 22:06
  • Relevant: https://stackoverflow.com/questions/7661045/how-to-highlight-more-than-two-characters-per-line-in-difflibs-html-output?rq=1 – stovfl Jun 02 '17 at 19:28
  • It is Relevant but it doesn't solve my problem. I guess no one is interested in this difflib. – John Jun 02 '17 at 20:23

1 Answers1

1

Question: I am looking for a word level highlight

Implements class Comma_HtmlDiff, expand Highlight to Comma boundaries:
You have to Overload difflib.ndiff.

Note: Only expand the first highlighted Part is implemented.
If difflib.ndiff highlights across Comma, this is not corrected.

class Comma_HtmlDiff(difflib.HtmlDiff):
    def __init__(self, tabsize=8, wrapcolumn=None, linejunk=None,
             charjunk=difflib.IS_CHARACTER_JUNK):
        setattr(difflib, '_ndiff', difflib.ndiff)
        setattr(difflib, 'ndiff', self.ndiff)
        super().__init__(tabsize, wrapcolumn, linejunk, charjunk)

    def ndiff(self, a, b, linejunk=None, charjunk=difflib.IS_CHARACTER_JUNK):
        _line = ''
        for line in difflib._ndiff(a, b, linejunk, charjunk):
            if line.startswith('-'):
                _d = '-'
                _line = line
            elif line.startswith('+'):
                _d = '+'
                _line = line

            if line.startswith('?'):
                dp = line.find(_d)
                if dp == -1:
                    _d = '+'
                    dp = line.find('^')
                dpl = _line.rfind(',', 0, dp)
                if dpl == -1:
                    dpl = 2
                else:
                    dpl += 1
                dpr = _line.find(',', dp)
                if dpr == dp:
                    _d = ' '
                    dpl = dp
                    dpr = dp+1

                dpw = dpr - dpl
                line = line[:dpl] + _d*dpw + line[dpr:]

            yield line

# Usage
diff = Comma_HtmlDiff().make_file(fromlines, tolines, fromfile,
                                    tofile, context=True,
                                    numlines=0)

Output:
enter image description here

Tested with Python: 3.4.2

stovfl
  • 13,298
  • 6
  • 18
  • 42
  • Thanks for the code and replying back. If I follow this method, the issue that I will get it, as you said is, The whole line is being split into individual words and each word is forming a new line. However, I don't want the line to be divided into multiple lines. I would like to have the whole line without being disturbed but highlight the words which are differed. Is it possible?. – John Jun 05 '17 at 15:22
  • Thank you so much sir. It really helped me in solving my issue. I will try to implement the comma part as you mentioned in your note. I tried to Vote but because of my low reputation it went to moderator!!! – John Jun 05 '17 at 22:48
  • I would like to ask another question (I hope I am not being repeated here..As No one was answered to my question except you)...Is it possible to do a word level diff using difflib (like we do in Unix Diff)?. I know difflib uses a `SequenceMatcher` but the output from this is awkward as it would not suite for the Regular Report files?? – John Jun 05 '17 at 22:51
  • @John: Checked `man diff (GNU diffutils) 3.3`, but did't see anything regarding _**word level**_. Edit your Question and show a _**example commandline**_ doing so. I couldn't imagin how `SequenceMatcher` are relevant to _**word level**_ output. – stovfl Jun 06 '17 at 13:42
  • The main idea of using difflib is I am trying to produce the same output as this `DiffChar`plugin for `vimdiff` in Unix (https://github.com/rickhowe/diffchar.vim). But, the result from the difflib is completely different. I know it uses different algorithm in detecting the changes, However, that was my whole idea. I will edit the question and try to post the inputs that are needed for this. – John Jun 06 '17 at 18:07
  • @John: If you can compile `DiffChar` as a `C DLL` you can use it in Python through `ctypes` module. – stovfl Jun 06 '17 at 18:51
  • Thanks for the answer. I think for the time being I am fine with the code that you provided. I am looking into Unix Diff Source code as it was implemented from `www.xmailserver.org/diff2.pdf` and I found out that in `Python` it was implemented by Neil `https://code.google.com/archive/p/google-diff-match-patch/` as `google-diff-match-patch` . I am currently looking into that and I will post a new question. Is there any way that I can catch up with you for any further issues. – John Jun 07 '17 at 17:57
  • Could you please take a look at this question, If you have some time. https://stackoverflow.com/questions/44424082/compare-word-level-differences-of-two-files-in-python – John Jun 07 '17 at 23:24