â€™ instead of ' in Natural Reader after encoding with utf-8

Question

I have some text that I got from the web. After processing, it is written to a txt file with

text_file = open("input.txt", "w")
text_file.write(finaltext.encode('utf-8'))
text_file.close()

When i open the txt file, everything is fine. But when I load it into Natural Reader to turn into audio. I see â€™ instead of ' only on some not all the '

what to do?

possibly related, or helpful, [“â€™” showing on page instead of “ ' ”](http://stackoverflow.com/questions/2477452/%C3%A2%E2%82%AC-showing-on-page-instead-of) — davedwards, Apr 26 '17 at 21:30
Yes, the initial a-macron is a sure sign that you have utf-8 being displayed as if it were one of the iso-8859-1 related encodings. Most likely some but not all of the single quotes are leaning quotes rather than apostrophes. — Peter DeGlopper, Apr 26 '17 at 21:31
How does Natural Reader handle Unicode? It seems like it would need to allow accented characters. — Mark Ransom, Apr 26 '17 at 22:25

Nick T · Accepted Answer · 2017-04-26T21:54:17.807

1

If you're opening the file with a native text editor and it looks fine, the issue is likely with your other program which isn't correctly detecting the encoding and mojibaking it up. As mentioned in comments, it's almost assuredly a Unicode quote character that looks like an ' but isn't.

my_string = ('The Knights who say '
    '\N{LEFT SINGLE QUOTATION MARK}'
    'Ni!'
    '\N{RIGHT SINGLE QUOTATION MARK}'
)
def print_repr_escaped(x):
    print(repr(x.encode('unicode_escape').decode('ascii')))

print_repr_escaped(my_string)
# 'The Knights who say \\u2018Ni!\\u2019'

If you can't control the encoding of the other program, you have 2 options:

Drop all Unicode characters like so:

stripped = my_string.encode('ascii', 'ignore').decode('ascii')
print_repr_escaped(stripped)
# 'The Knights who say Ni!'

Attempt to convert Unicode characters to ASCII with something like Unidecode

import unidecode

converted = unidecode.unidecode(my_string)
print_repr_escaped(converted)
# "The Knights who say 'Ni!'"

edited Apr 26 '17 at 21:54

answered Apr 26 '17 at 21:39

Nick T

22,202
10
72
110

option 2 worked. maybe i implemented option 1 wrong, but it stripped the `'` out of the text. – jason Apr 26 '17 at 23:12
That's what I meant by "drop the characters". The variable is also called `stripped` ;) – Nick T Apr 26 '17 at 23:44
No objections if this got you past your immediate problem, but it's not an ideal overall solution. It's just not true that all of unicode can be collapsed into ASCII. It'd probably be worthwhile to spend some time figuring out how to tell Natural Reader what encoding your files are using. – Peter DeGlopper Apr 27 '17 at 02:55

score 1 · Answer 2 · answered Apr 27 '17 at 03:42

If you are on Windows, many Windows applications assume the native ANSI encoding for files unless there is a byte order mark (BOM) at the beginning of the file. A BOM is not normally necessary for UTF-8, but serves as a signature for a UTF-8 file on Windows. You can write one with the utf-8-sig codec. the following will work on Python 2.x and 3.x:

import io
with io.open("input.txt", "w", encoding='utf-8-sig') as text_file:
    text_file.write(finaltext)

â€™ instead of ' in Natural Reader after encoding with utf-8

2 Answers2