I have the following function:
def storeTaggedCorpus(corpus, filename):
corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
for token in corpus:
tagged_token = '/'.join(str for str in token)
tagged_token = tagged_token.decode('ISO-8859-1')
tagged_token = tagged_token.encode('utf-8')
corpusFile.write(tagged_token)
corpusFile.write(u"\n")
corpusFile.close()
And when I execute it, I've got the following error:
(...) in storeTaggedCorpus
corpusFile.write(tagged_token)
File "c:\Python26\lib\codecs.py", line 691, in write
return self.writer.write(data)
File "c:\Python26\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
. If the
corpusFile.write(tagged_token)
is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8')
, it will also work, BUT the resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I'm analizing pt-br text, this is unacceptable.
I believe that everything would work fine if the corpusFile opened as UTF-8, but I can't get it to work. I've searched the Web, but everything I found about Python/Unicode dealt with something else...s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.
Edit — About the corpus
parameter
I don't know the encoding of the corpus
string. It was generated by PlaintextCorpusReader.tag()
method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1')
is just a guess. I've tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.