-2

I am trying to compare two Arabic strings using python's difflib.HtmlDiff module. I have looked at various ways of writing the outputs of HtmlDiff to a file but none seems to work for me. Methods I have tried so far:

Note: in all subsequent code snippets, original and mockinputs are lists of strings, as required by HtmlDiff, of Unicode text, specifically Arabic.

Method 1
import difflib

hdiff = difflib.HtmlDiff()
html = hdiff.make_file(original, mockinputs)

with open('out_file.html', 'w', encoding='utf-8') as out_file:
    out_file.write(html)

This runs without error but the html file created is gibberish (things like الرحÙ) when opened in browser.

Method 2 (as pointed out here)
import difflib

htmldiff = difflib.HtmlDiff()
html = htmldiff.make_file(original, mockinputs)

out_file = open('out_file.html', 'w')
out_file.write(html.encode('utf-8'))
out_file.close()

This gives me this error:

TypeError: must be str, not bytes

So, how can I write Unicode texts produced by HtmlDiff as shown here to an html file in python 3?

I am using python 3.4.3.

Community
  • 1
  • 1
Sнаđошƒаӽ
  • 13,406
  • 11
  • 67
  • 83
  • 1
    This is a guess, but the documentation for `difflib` says that that the `make_file` method was changed for Python 3.5 to have a default charset of "utf-8": https://docs.python.org/3/library/difflib.html#difflib.HtmlDiff.make_file Did you try this in Python3.5? If I had to bet, I would look at that `make_file` method as the culprit. Also, your first two attempts are doing the same thing, and can you get `difflib` to work without creating an HTML table? – erewok Dec 30 '15 at 17:56
  • 1
    Thanks for pointing to that new change in 3.5. I think that's the ticket :-) But let's see what else turns up! – Sнаđошƒаӽ Dec 30 '15 at 18:00
  • 1
    The documentation I linked above says that `make_file` used to have a default charset of `ISO-8859-1`, which would not include Arabic. Further, most browsers are going to see `ISO-8859-1` and fallback to ASCII (at least they used to). Thus, you have to use Python3.5 or generate the output yourself. (Nevermind, you found it...) – erewok Dec 30 '15 at 18:01
  • 1
    With the first two methods, the HTML file might be valid UTF-8, but the browser may assume it is ISO-8859-1. You can try if inserting **** in the HTML header fixes the issue. Method 3 probably assumes you're using Python 2. – roeland Dec 30 '15 at 22:54
  • 1
    @roeland I mentioned in the title that I am using python 3, and meta charset thing is already there. – Sнаđошƒаӽ Dec 31 '15 at 11:35

1 Answers1

3

According to the documentation, the make_file method in versions of Python before Python3.5 defaulted to a charset of ISO-8859-1, which would not include Arabic.

Further, most browsers are going to see ISO-8859-1 and fallback to ASCII. Thus, you have to use that method in Python3.5 in order to get utf-8 or generate the HTML output that you would like in a different way.

Edit: as of python 3.5.1, though the make_html method uses default charset utf-8, its brother method make_table doesn't, so take care using the latter!

Sнаđошƒаӽ
  • 13,406
  • 11
  • 67
  • 83
erewok
  • 6,848
  • 2
  • 30
  • 36