1

I have sifted through lots and lots of python/unicode explanations but I just can't seem to make sense of this.

Here is the situation:

I am pulling loads of comments off reddit (making a bot) and would like to primarily store them in a MongoDB, but also need to be able to print out comment trees in order to manually check what's going on.

I have had no problems so far putting comments into the DB, but when I try to print to stdout the CP1252 charset is having trouble with characters that it obviously doesn't support.

As I have read, in Python 3 everything internally (strings) are stored as Unicode, it's the input and output which must be bytes, so this is fine - I can encode the unicode to CP1252 and in a couple of situations I will see \x** characters which I don't mind - I am guessing they represent out of range characters?

The problem is I was printing out comment trees (to stdout) using \n (linefeeds) and tabs so it was easy to look over, but apparently when you encode a unicode string with newline escape sequences it escapes them so they get printed as literals.

For reference here is my encode statement:

encoded = post.tree_to_string().encode('cp1252','ignore')

Thanks

EDIT:

What I want is

|Parent Comment

    |Child comment 1

        |GChild comment 1

    |Child comment 2

|Parent Comment 2

What I get is

b"\n|Parent comment \n\n |Child comment \n\n etc
Alex
  • 13
  • 5
  • 1
    Are your really `print`ing the strings, your are you just looking at the string at the python prompt? – oefe Oct 06 '13 at 14:14
  • I want to be able to print them to a file/stdout so I can manually look over them - see example I am now putting in main post – Alex Oct 06 '13 at 15:52

3 Answers3

2

When printing to the console, Python will automatically encode strings in the console's encoding (cp437 on US Windows) and raise an exception for any character that the console encoding does not support. for example:

#!python3
#coding: utf8
print('Some text\nwith Chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')

Output:

Traceback (most recent call last):
  File "C:\test.py", line 5, in <module>
    print('Some text\nwith Chinese \u7f8e\u56fd\ncp1252 \xc0\xc1\xc2\xc3\nand cp437 ░▒▓')
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24-25: character maps to <undefined>

To change this default, you can alter stdout to explicitly specify the encoding and how to handle errors:

#!python3
#coding: utf8
import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=sys.stdout.encoding,errors='replace')
print('Some text\nwith Chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')

Output (to a cp437 console):

Some text
with Chinese ??
cp1252 ????
and cp437 ░▒▓

You can also do this explicitly without altering stdout, by writing directly to its buffer interface:

sys.stdout.buffer.write('Some text\nwith Chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓'.encode('cp437',errors='replace'))

A third alternative is to set the following environment variable before starting Python, which will alter stdout similar to the TextIOWrapper solution:

PYTHONIOENCODING=cp437:replace

Finally, since you mentioned also writing to a file, the easiest way to see all the characters you are writing is to use UTF-8 as the encoding to a file:

#!python3
#coding: utf8
with open('out.txt','w',encoding='utf8') as f:
    f.write('Some text\nwith Chinese 美国\ncp1252 ÀÁÂÃ\nand cp437 ░▒▓')
Mark Tolonen
  • 132,868
  • 21
  • 152
  • 208
  • That last section of code actually got it to write to a file with the format I wanted so thanks. It is strange that I can't get it to write to stdout but I need to understand the subject better. In the meantime this will do for me. – Alex Oct 06 '13 at 17:08
0

I don't know if I understood your problem correctly, but couldn't you just remove newlines and tabs before printing to stdout?

print(re.sub('[\t\n]', ' ', post.tree_to_string()))

You could also tell Python to remove all control chars, as stated here.

Community
  • 1
  • 1
Lucio Paiva
  • 13,507
  • 6
  • 71
  • 90
  • I need the newlines for formatting - the tree_to_string gives a nice view of comments with spacing between them and indenting - so I want stdout to leave the newlines in place of '\n' but it does not parse them as linefeeds - I guess they are escaped? – Alex Oct 06 '13 at 15:45
0

It's NOT need encode stings to bytes for printing in python3,just make your stdout(console) an unicode environment...

print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)

Alan Hall
  • 1
  • 2
  • I have heard this is bad practice? I will try this if I cannot find anything else though. – Alex Oct 06 '13 at 16:15