3

I have the following function:

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        tagged_token = tagged_token.decode('ISO-8859-1')
        tagged_token = tagged_token.encode('utf-8')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

And when I execute it, I've got the following error:

(...) in storeTaggedCorpus
    corpusFile.write(tagged_token)
  File "c:\Python26\lib\codecs.py", line 691, in write
    return self.writer.write(data)
  File "c:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

So i went to debug it, and discovered that the created file was encoded as ANSI, not UTF-8 as declared in corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8'). If the corpusFile.write(tagged_token) is removed, this function will (obviously) work, and the file will be encoded as ANSI. If instead I remove tagged_token = tagged_token.encode('utf-8'), it will also work, BUT the resulting file will have encoding "ANSI as UTF-8" (???) and the latin characters will be mangled. Since I'm analizing pt-br text, this is unacceptable.

I believe that everything would work fine if the corpusFile opened as UTF-8, but I can't get it to work. I've searched the Web, but everything I found about Python/Unicode dealt with something else...s So why this file always ends up in ANSI? I am using Python 2.6 in Windows 7 x64, and those file encodings were informed from Notepad++.

Edit — About the corpus parameter

I don't know the encoding of the corpus string. It was generated by PlaintextCorpusReader.tag() method, from NLTK. The original corpus file was encoded in UTF-8, according to Notepad++. The tagged_token.decode('ISO-8859-1') is just a guess. I've tried to decode it as cp1252, and got the same mangled characters from ISO-8859-1.

Community
  • 1
  • 1
Metalcoder
  • 1,684
  • 3
  • 22
  • 28
  • 2
    What is an "ANSI encoded" file? – sarnold Nov 09 '11 at 00:10
  • I don't have a Windows installation at hand (and the issue almost certainly stems from Windows' strange file handling), but you should either open the file with mode `'w', encoding='utf8'` and write `unicode` objects (the results of `decode`) **or** open the file with mode `'wb'`(no encoding) and write `str` objects (the result of `encode`). – phihag Nov 09 '11 at 00:13
  • To start with, is `token` really a str encoded in `ISO-8859-1`? – Petr Viktorin Nov 09 '11 at 00:14
  • @sarnold check [this](http://stackoverflow.com/questions/701882/what-is-ansi-format). – Metalcoder Nov 09 '11 at 00:15
  • @PetrViktorin It seems so. The original code didn't have this line, and it crashed in the next line with a `UnicodeDecodeError`. After including this line, it stopped complaining. – Metalcoder Nov 09 '11 at 00:17
  • @phihag **(1)** I took off the `encode('utf-8')` line and switched `utf-8` to `utf8` when opening the file, to no luck (again, "ANSI as UTF-8"). **(2)** this time, i've kept the `encode('utf-8')` line and changed mode to 'wb', and it threw a `UnicodeDecodeError` in `corpusFile.write(tagged_token)`. – Metalcoder Nov 09 '11 at 00:20
  • @Metalcoder Found a Windows VM! Expanded both into an answer. Feel free to comment there. – phihag Nov 09 '11 at 00:26
  • @Metalcoder, aha, thanks; so probably [Windows Codepage 1252](http://en.wikipedia.org/wiki/Windows-1252) unless it isn't. Sheesh. Any chance you could run [`file(1)`](http://www.darwinsys.com/file/) on it? :) – sarnold Nov 09 '11 at 00:29
  • @MetalCoder: If you are on Windows, the probability that you have ISO-8859-1 is **ZERO**. As you say you are dealing with pt-br text, you are most likely to have cp1252. – John Machin Nov 09 '11 at 00:34
  • How do you know how the file was encoded? A text file containing only 7-bit characters is simultaneously ASCII *and* ISO-8850-1 (Latin-1) *and* "ANSI" (really Windows-1252 or CP-1252) *and* UTF-8. The formats differ only in how they represent characters with codes outside the range 0..127. – Keith Thompson Nov 09 '11 at 01:03
  • @sarnold: You might want to read [this question](http://stackoverflow.com/questions/701882/what-is-ansi-format) again; I've just edited the answer for greater accuracy. – Keith Thompson Nov 09 '11 at 01:04
  • @Keith, your edits are definitely an improvement -- but your comment here is best yet. :) – sarnold Nov 09 '11 at 01:12
  • It looks to me as if you are double encoding. But I think the first troubleshooting step is to try it with known data: write a single string with Unicode content of your choice. – Harry Johnston Nov 09 '11 at 02:05
  • If the source file was Unicode, perhaps corpus is already Unicode and you don't need the .decode *or* the .encode. – Harry Johnston Nov 10 '11 at 22:15
  • In my comment from yesterday starting "How do you know ...", "ISO-8850-1" should have been "ISO-8859-1". – Keith Thompson Nov 10 '11 at 22:15

3 Answers3

3

When you open the file with codec.open('w', encoding='utf8'), there is no point in writing byte arrays (str objects) into the file. Instead, write unicode objects, like this:

corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token)
corpusFile.write(u'\n')

This will write platform-dependent End-Of-Line characters.

Alternatively, open a binary file and write byte arrays of already-encoded strings:

corpusFile = open(filename, mode = 'wb')
# ...
tagged_token = '\xdcml\xe4ut'
tagged_token = tagged_token.decode('ISO-8859-1')
corpusFile.write(tagged_token.encode('utf-8'))
corpusFile.write('\n')

This will write platform-independent EOLs. If you want a platform-dependent EOL, print os.sep instead of '\n'.

Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8 is what you want.

Community
  • 1
  • 1
phihag
  • 245,801
  • 63
  • 407
  • 443
  • No luck :( See my answer to your comment, in the question. – Metalcoder Nov 09 '11 at 00:28
  • @Metalcoder Updated the answer with an explanation of why this code works ;) If you are certain the result is **not** UTF-8 (and if Notepad++ names it `ANSI as UTF-8`, it *is* UTF-8), can you post a hexdump of the file written by one of the two alternative [executable](http://sscce.org) programs in this answer? – phihag Nov 09 '11 at 00:31
  • Don't need a hex dump; Python: `print repr(open("thefile", "rb").read(200))` should do the trick. – John Machin Nov 09 '11 at 00:38
  • @Metalcoder John was listing ways to create the hexdump I mentioned (See the second comment under this answer). – phihag Nov 09 '11 at 00:48
  • @JohnMachin Re: s/ISO-8859-1/cp1252/ . Are you sure about that? His input might just be ISO-8859-1. – phihag Nov 09 '11 at 00:53
  • @phihag: The actually-used chars in cp1252 are a superset of the actually-used chars in iso-8859-1. It is common for web pages and files to have characters that are in cp1252 but not in iso-08859-1, but they or their owners declare the contents to be iso-5589-1. Practical people, including the ones who wrote the draft HTML 5 standard, suggest that everyone quietly ignore the alleged encoding and use cp1252 instead. – John Machin Nov 09 '11 at 02:53
  • @JohnMachin: ISO-8859-1 should not be blindingly considered to be a subset of Windows-1252: if one is using the control codes in ISO-8859-1, the two character sets are disjoint. There are two variants of ISO-8859-1, one without the control and one with, and since just about every encoder out there is using the one **with** the control codes, *including* Python, that one with is the relevant one here. Also, frankly, "practical people [who] suggest that everyone quietly ignore the [character set that a document is advertised as being encoded in] and use cp1252 instead" are out of their minds. – Thanatos Nov 09 '11 at 08:57
  • @phihag: instead of printing the newline in two lines of code with `write()`, one can directly use the more powerful `print >> corpusFile, tagged_token`. This allows the program to use all the capabilities of the print statement (like `print 1, 3, "Hello"`, which would be more cumbersome if using `write()` instead). – Eric O Lebigot Nov 09 '11 at 09:24
  • @Thanatos: Yes, it is a deficency that `ISO-8859-1` encoders that don't do the C1 control codes are not more widely available. In practice, if you find any data with Unicode codepoints in the range 0080-009F, the probability that they were originally 8-bit C1 control characters is vanishingly small, especially if the data was created in a Windows environment. – John Machin Nov 09 '11 at 10:03
  • @Thanatos: The world has more to fear from people who do things like given a collection of Windows-origin files encoded in a mixture of cp1252 and cp850, blindly transcode them from ISO-8859-1-with-C1-control-codes to UTF-8 and then concatenate them into one monstrous mess and delete the originals. – John Machin Nov 09 '11 at 10:05
  • @EOL I think that `write` is easier to read and spares questions about the obscure internals of a deprecated statement. Also, I'm not sure what EOL `print` uses on Windows. Even if that wouldn't be the case, the focus here should be on bytes(`str`) vs strings(`unicode`), something the magic of the print statement would hide. – phihag Nov 09 '11 at 11:45
  • @phihag: I believe that the "print chevron" is actually not deprecated (see http://docs.python.org/reference/simple_stmts.html#the-print-statement). Furthermore, I also think that it is underused. Of course, Python 3 users can usefully use the new `print()` function, as do Python 2 users with `from __future__ import print_function`. `print` also has the advantage of printing the correct newline on Windows, for text files. Speaking of which, your second (binary) solution fails to write the correct newline sequence on Windows; it would be great to also check the first solution's newline. – Eric O Lebigot Nov 09 '11 at 13:28
  • @EOL While it isn't formally deprecated, `print >>` is not available on Python 3. Since the OP wants to write UTF-8 without a BOM, and that's a typical platform-independent format, I'd strongly assume he wants to write platform-independent files. I see no reason why anyone who does not interface with legacy components would want platform-dependent output. Updated the answer with a note about EOLs. – phihag Nov 09 '11 at 14:09
  • @phihag: Thanks for adding information about EOLs. I'm not sure I understand what your point about `print >>` is: in fact, a reason why Python 3's `print()` has a `file` option is precisely to give the same effect as `print >>`. Thus, the equivalent of the modern `print()` function is the Python 2 `print >>` statement (not `write()`, which only takes a single, string argument). `print >>` is very useful: doing the equivalent of `print >> corpusFile, 1, 2, x, "Hello"` with `write()` would be arguably much less convenient, simple and legible. – Eric O Lebigot Nov 09 '11 at 17:22
  • @EOL Is it your personal mission to include EOL info everywhere? ;) You're right; print would work as well, and translate well to modern Python. I think I personally associate it with strings (instead of byte arrays) and magic though (despite being fully documented) - and that's precisely what I wanted to avoid in this answer. Asked the other way round, is there any advantage of `print` over `write`, apart from personal preference? – phihag Nov 09 '11 at 17:27
  • @phihag: Thanks for the discussion. Even though both `print` and `write` have their use, `print` has the following two advantages: (1) you don't have to add the usual `\n` or `os.sep`, which makes the code slightly faster to read; (2) you can sometimes dispense with string formatting, which again renders the code slightly faster to read: (Python 3) `print(x, '=', y, file=corpusFile)` is quite direct, compared to `corpusFile.write('{} = {}{}'.format(x, '=', y, os.sep))`. Thus, I really think that `print()` (Python 3) and `print >>` (Python2) serve a real purpose. – Eric O Lebigot Nov 09 '11 at 21:49
  • @phihag: Personal note: the initials of my full name are EOL, hence my StackOverflow name. No relation with end-of-line. :) – Eric O Lebigot Nov 09 '11 at 21:51
  • @JohnMachin and @phihag: Sorry, I've never came accross a hexdump before, so I didn't get the suggestion of running `print open("thefile", "rb").read().decode("utf8")` at the first time. I've tried it, and got a "UnicodeEncodeError: 'ascii' codec can't encode character...". It crashed on the first line: "ï/N". This line is weird... the following line is "»¿/N". Both isn't in the corpus before tagging. Only at the third the outcome expected: "criadouro/N". These two characters, "ï" and "¿" doesn't happen in my corpus, and aren't used in pt-br. In fact, both are really exotic in this language. – Metalcoder Nov 10 '11 at 22:20
  • Ok folks...time to move this to a chat room. Comments are not intended for extended discussion. Thanks – Kev Nov 10 '11 at 22:44
  • 1
    @Metalcoder You may want to read up on [Unicode and character encodings](http://www.joelonsoftware.com/articles/Unicode.html). If you execute out `print repr(open(filename, "rb").read(200))`, what gets outputted when you use the first and second program in this answer? – phihag Nov 10 '11 at 23:08
1

Try writing the file with a UTF-8 signature (aka BOM):

def storeTaggedCorpus(corpus, filename):
    corpusFile = codecs.open(filename, mode = 'w', encoding = 'utf-8-sig')
    for token in corpus:
        tagged_token = '/'.join(str for str in token)
        # print(type(tagged_token)); break
        # tagged_token = tagged_token.decode('cp1252')
        corpusFile.write(tagged_token)
        corpusFile.write(u"\n")
    corpusFile.close()

Note that this will only work properly if tagged_token is a unicode string. To check that, uncomment the first comment in the above code - it should print <type 'unicode'>.

If tagged_token is not a unicode string, then you will need to decode it first using the second commented line. (NB: I've assumed a "cp1252" encoding, but if you're certain it's "iso-8859-1", then of course you will need to change it.)

ekhumoro
  • 98,079
  • 17
  • 183
  • 279
  • Oh, man, it prints !!! I've tried switching to `u'/'` in the `join` thing, and it threw a UnicodeDecodeError. I didn't expected this, and I'm going to run some tests. – Metalcoder Nov 10 '11 at 22:28
  • 1
    @Metalcoder. Switching to `u'/'` won't work, because the rest of the string won't be decoded properly. To do that, remove the print statement and uncomment the second comment as shown above. – ekhumoro Nov 10 '11 at 23:05
0

If you are seeing "mangled" characters from a file, you need to ensure that whatever you are using to view the file understands that the file is UTF-8-encoded.

The files created by this code:

import codecs
for enc in "utf-8 utf-8-sig".split():
    with codecs.open(enc + ".txt", mode = 'w', encoding = enc) as corpusFile:
        tagged_token = '\xdcml\xe4ut'
        tagged_token = tagged_token.decode('cp1252') # not 'ISO-8859-1'
        corpusFile.write(tagged_token) # write unicode objects
        corpusFile.write(u'\n')

are identified thusly:

Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8
Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8)
Notepad (Windows 7): UTF-8, UTF-8

Putting a BOM in your UTF-8 file, while deprecated on Unix systems, gives you a much better chance on Windows that other software will be able to recognise your file as UTF-8-encoded.

John Machin
  • 75,436
  • 11
  • 125
  • 178
  • I've tried sending a BOM before posting this question, and got the same problems. But the BOM should be a issue to find out the encoding of the file. I believe that it would have no effect while storing something into it. Am I wrong? – Metalcoder Nov 10 '11 at 22:28