0

When I use simple I/o calls to read a particular file on my system, such as:

f = open('file.ini')
for line in f.readlines():
    print line

I'm getting output such as this:

 H E L L O !  W H Y  A R E  T H E R E  S O  M A N Y  S P A C E S ?

I presume it's Unicode but I can't quite figure out how to read it as Unicode / convert it to ascii. Suggestions?

Brian D
  • 8,553
  • 17
  • 57
  • 93
  • 1
    While we're at it, there is no such thing as Unicode. Well, there is, but it does not concern itself with earthly matters like files. There are numerous encodings which bridge that gap, but you don't appear to be aware of that or the difference this makes. See also: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html)) –  Oct 09 '12 at 23:22
  • I understand that it's simply a matter of defining how the bits are to be interpreted -- I really just want to know if there's a simple way that I don't know to makes these bits look nice without having to start writing some low-level byte-truncating goop. – Brian D Oct 09 '12 at 23:25
  • 1
    Oh, you misunderstand me. Your code does not (yet) have to worry about stuff like that, and even when you start caring about these tings, you (praise python-dev!) don't have to work at byte level. Your problem is simpler, but I wanted to bring up an unrelated issue your language hints at. –  Oct 09 '12 at 23:28

2 Answers2

4

Try opening the file using codecs to make things easier.

Example:

import codecs
f = codecs.open('file.ini', encoding='utf-16-le')  # You can experiment with different encodings
for line in f:  # note, the readlines is not really needed
    print line,  # the comma strips the trailing newline in case that's bothering you

PS: if you don't know the encoding, I recommend looking at this question: Determine the encoding of text in Python

Community
  • 1
  • 1
Wolph
  • 69,888
  • 9
  • 125
  • 143
1

V e r y   r e g u l a r   s p a c e s are usually an indicator that your data is encoded in UTF16 -- Usually what you see is that every second byte is a 0 byte. You can confirm this by printing out the actual binary data that you are reading:

f = open('file.ini')
line in f.readline():
print map(ord, line)

If you see output like this:

[..., 68, 0, 65, 0, 76, 0, 76, 0, 79, ...]

Then that's almost certainly the case.

The trick, then, is to figure out whether it's the even bytes that are 0s, or the odd bytes. There are two UTF-16 encodings: Big-endian and little-endian, named for the significance of the byte that comes first. If your 0s come before the character that they are associated with, then the file is big-endian, and you can open it like this (Python 3.x):

f = open('file.ini', encoding='utf16be')

In Python 2.x, import the codecs module to do this:

import codecs
f = codecs.open('file.ini', encoding='utf16be')

If the 0s come after, then substitude 'utf16le'.

(You need to make sure that you decode the file as you're reading it, or read the entire contents into memory before decoding. You definitly do not want to split lines apart before you decode)

If you're lucky, then the file was written with a Byte Order Mark at the beginning this character is U+FEFF-- if the first two bytes are [254, 255], then the encoding is big-endian, and if [255, 254], then it is little-endian.

If none of those apply, then you might not be looking at UTF-16 data, and you'll have to do some more research to figure out what encoding you're looking at.

Ian Clelland
  • 39,551
  • 8
  • 79
  • 83
  • Giving you the gold for a complete explanation. I've already solved my problem, but for posterity's sake, this explanation will help others greatly. – Brian D Oct 10 '12 at 01:58