how to read text copied from web to txt file using python

Question

I'm learning how to read text files. I used this way:

f=open("sample.txt")

print(f.read())

It worked fine if I typed the txt file myself. But when I copied text from a news article on the web, it produced the following error:

UnicodeEncodeError: 'charmap' codec can't encode charater '\u2014' in position 738: character maps to undefined

I tried changing the Encoding setting in Notepad++ to UTF-8 as I read somewhere it is due to that

I also tried using:

f=open("sample.txt",encoding='utf-8')

from here

But it still didn't work.

The problem is with your terminal's encoding; it is configured to accept only ASCII, so that's what Python tries to provide, but it can't figure out how to encode `\u2014` in ASCII. — chepner, Mar 26 '16 at 14:04
I tried using 'sys.getdefaultencoding()' and it showed 'utf-8'. But when I used 'reload(sys)' and 'sys.setdefaultencoding' to change it, it showed error 'NameError: name 'reload' is not defined' when I typed 'reload(sys)'. Does this mean by default my terminal accepts only utf-8? If so, what was the problem when I tried to change it? — SK90, Mar 26 '16 at 14:37
@chepner, the Windows console, using the `charmap` codec is not limited to ASCII - it's limited to 8bit character sets/code pages — Alastair McCormack, Mar 26 '16 at 21:44
@SK90, `reload(sys);sys.setdefaultencoding()` was a terrible hack on Python 2, used by people who don't know how encoding worked. It was intentionally made difficult to set. On Python 3, it's been made impossible. — Alastair McCormack, Mar 26 '16 at 21:51
related: [Python, Unicode, and the Windows console](http://stackoverflow.com/q/5419/4279) — jfs, Apr 29 '16 at 15:21

Alastair McCormack · Accepted Answer · 2016-03-26T22:10:03.433

1

You're on Windows and trying to print to the console. The print() is throwing the exception.

The Windows console only natively supports 8bit code pages, so anything outside of your region will break (despite what people say about chcp 65001).

You need to install and use https://github.com/Drekin/win-unicode-console. This module talks at a low-level to the console API, giving support for multi-byte characters, for input and output.

Alternatively, don't print to the console and write your output to a file, opened with an encoding. For example:

with open("myoutput.log", "w", encoding="utf-8") as my_log:
    my_log.write(body)

Ensure you open the file with the correct encoding.

edited Mar 26 '16 at 22:10

answered Mar 26 '16 at 21:57

Alastair McCormack

23,069
7
60
87

The win-unicode-console worked! (even though I noticed one character came out differently). The alternative method works too but my goal was to achieve what the win-unicode-console did. Thanks! – SK90 Mar 27 '16 at 05:25
1- Windows console does support all Unicode characters and it can even display (if you configure an appropriate font) any (BMP) Unicode character. 2- the file may use any encoding that can represent characters in `body`. On Windows, `utf-16` could be preferable because `utf-8` might be misinterpreted by some tools—though it is a matter of preference. – jfs Apr 29 '16 at 15:21

Serge Ballesta · Answer 2 · 2016-03-27T09:45:37.943

0

I assume that you are using Python 3 from the open and print syntax you use.

The offending character u"\u2014" is an em-dash — (ref). As I assume you are using Windows, maybe setting the console in UTF8 (chcp 65001) could help provided you use a not too old version.

If it is a batch script, and if the print is only here to get traces, you could use explicit encoding with error='replace'. For example assuming that you console uses code page 850:

print(f.read().encode('cp850', 'replace'))

This will replace all unmapped characters with ? - not very nice, but at least it does not raise...

edited Mar 27 '16 at 09:45

answered Mar 26 '16 at 14:27

Serge Ballesta

121,548
10
94
199

How do you set the console in UTF8? I'm beginner in this so I didn't understand what you meant by traces and the following code. – SK90 Mar 26 '16 at 15:04
@SK90 For Windows, UTF8 is the code page 65001. To set the console in UTF8 (as much as it can...) you just type `chcp 65001` at cmd prompt. – Serge Ballesta Mar 26 '16 at 16:54
@SergeBallesta, while the default encoding in Python 3 is UTF-8 (when creating Unicode `str` from bytes), it is *not* the default encoding used for opening file. The locale from the user's environment is used for the default encoding, as returned by `locale.getpreferredencoding()`. – Alastair McCormack Mar 26 '16 at 21:37
I mean the `chcp65001` method – SK90 Mar 27 '16 at 05:20

how to read text copied from web to txt file using python

2 Answers2

Linked