Python Encoding NLTK - 'charmap' codec can't encode character

Question

   import pypyodbc
   from pypyodbc import *
   import nltk 
   from nltk import *
   import csv
   import sys
   import codecs
   import re

   #connect to the database 
   conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.Mdb)};\
          DBQ=C:\\TextData.mdb')

   #create a cursor to control the datbase with
   cur = conn.cursor()

   cur.execute('''SELECT Text FROM MessageCreationDate WHERE Tags LIKE 'GHS - %'; ''')
   TextSet = cur.fetchall()
   ghsWordList = []
   TextWords = list(TextSet)

   for row in TextWords :
       message = re.split('\W+',str(row))
       for eachword in message :
            if eachword.isalpha() :
               ghsWordList.append(eachword.lower())

   print(ghsWordList)

When I run this code, it's giving me an error:

'charmap' codec can't encode character '\u0161' in position 2742: character maps to <undefined>

I've looked at a number of other answers on here to similar questions, and googled the hell out of it; however I am not well versed enough in Python nor Character Encoding to know where I need to used the Codecs module to change the character set being used to present/append/create the list?

Could someone not only help me with the code but also point me in the direct of some good reading materials for understanding this sort of thing?

Possible duplicate of [UnicodeEncodeError: 'charmap' codec can't encode - character maps to , print function](http://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined) — roeland, Dec 23 '15 at 03:18
Windows 10 unfortunately, I would like to use some linux distro, but I haven't used linux for a number of years and I need this work to be done relatively quickly and can't compound my problem with learning a new OS. The line that throws the exception is the very last one, print(ghsWordList), it seems that it is a problem with displaying the characters as opposed to handling them inherently with the code (as I am able to export them to an access database and xls and they display correctly there) — Samuel Jackson, Dec 28 '15 at 14:43

Luis Miguel · Answer 1 · 2015-12-23T13:30:18.683

0

If you are using Python 2.x, add the following lines to your code:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Note: if you are using Python 3.x, reload is not a built-in, it is imp.relaod(), so an import needs to be added for my solution to work. I don't develop in 3.x, so my suggestion is:

from imp import reload
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Place this ahead of all your other imports.

edited Dec 23 '15 at 13:30

answered Dec 23 '15 at 03:22

Luis Miguel

4,713
7
36
67

While this may avoid the UnicodeEncodeError, it will probably print mojibake on the console if any non-ASCII characters are printed. – roeland Dec 23 '15 at 03:30
Thank you for answering!I tried this with no success, I get "name reload is not defined" as an error. – Samuel Jackson Dec 23 '15 at 06:29
Samuel Jackson, check out the expansion of my answer above. – Luis Miguel Dec 23 '15 at 13:30
Thank you for all the help. It seems that after a little research into what you've been advising; once get you past 3.4.1 it changes again to import importlib, which still doesn't seem to work. – Samuel Jackson Dec 23 '15 at 14:05
No, no, no! Setting `sys.setdefaultencoding()` is a hack for people who don't understand how Python's encoding works. Please don't use it! http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script – Alastair McCormack Dec 24 '15 at 22:37
print(ghsWordList) the very last line throws the exception, I think it is that the interpreter can't handle displaying the characters as opposed to the code not actually handling the characters in the intended way themselves. I've decided not to use the setdefaultencoding method given that it seems to conflict the methodology set out by the python framework, plus many 3rd party modules will not work if I decided to use them because of the ASCII problem. For the moment, I've chosen a workaround which exports the characters to somwhere that can understand them. Still not solved it though. – Samuel Jackson Dec 28 '15 at 14:38

Python Encoding NLTK - 'charmap' codec can't encode character

1 Answers1