0

I've perused the other topics related to this question, but none have directly answered the question. I'm hoping that perhaps you all can help.

I'm working on cleaning up a Wordpress Theme which has been long abused and uncleaned. We have about 10-12 CSS files that aren't being used. Just before I was going to delete them, I was told that some of the files may have been referenced in the actual content in the site. Shudder I'm using Python to search the line for the name of the file. If it finds the name, it renders the line from the file it was located, and the line in its entirety. Lastly it displays the end results and closes the files, etc. Here is the code. (Heads up... I'm not the most comfortable with Python.)

cssfile = open("css.txt", "r")
s = open("berea.sql", "r", encoding="utf-8")

totalfound = 0
lineinfile = 0

for filename in cssfile:
    for line in s:
        lineinfile = lineinfile+1
        for filename in line:
            print (lineinfile, line)
            totalfound = totalfound+1
    lineinfile=0
    if totalfound == 0:
        print ("No results were found for %s") % filename
    else:
        print ("We found %i of %s in the database") % (totalfound, filename)


cssfile.close()
searchfile.close()

Honestly, the biggest problem comes from the encoding error I receive.

UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
445: character maps to <undefined>

I've seen that adding different decodes, encodes, etc. should fix it, but nothing seems to work... I would appreciate any and all assistance. I have about 349,000 lines to search through, and I keep getting stopped at 830.

bastelflp
  • 6,818
  • 5
  • 27
  • 61
Mark Ross
  • 125
  • 2
  • 15

2 Answers2

0

https://wiki.python.org/moin/PrintFails details this error.

"UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in position 0: character maps to undefined"

This means that the python console app can't write the given character to the console's encoding.

More specifically, the python console app created a _io.TextIOWrapperd instance with an encoding that cannot represent the given character.

...

By default, the console in Microsoft Windows only displays 256 characters (cp437, of "Code page 437", the original IBM-PC 1981 extended ASCII character set.)

If you try to print an unprintable character you will get UnicodeEncodeError.

Setting the PYTHONIOENCODING environment variable as described above can be used to suppress the error messages. Setting to "utf-8" is not recommended as this produces an inaccurate, garbled representation of the output to the console. For best results, use your console's correct default codepage and a suitable error handler other than "strict".

Try ignoring some of this advice and do the following in Windows CMD:

set PYTHONIOENCODING=utf-8
chcp 65001

Also set your console font to: Lucinda Console

This should set the console to a crappy UTF-8 emulation and force Python to encode to UTF-8.

You may find it simpler to write the results to a UTF-8 encoded file instead of writing to a console.

Use https://github.com/Drekin/win-unicode-console

Community
  • 1
  • 1
Alastair McCormack
  • 23,069
  • 7
  • 60
  • 87
  • 2
    Python 3 uses Unicode strings, so it's relatively simple to use the Windows wide-character (UTF-16) API. Just use [win-unicode-console](https://github.com/Drekin/win-unicode-console). Avoid using codepage 65001 (UTF-8). The console system wasn't designed for a variable encoding that uses up to 4 bytes per code (the design is from the early 90s when Unicode was UCS2 and UTF-8 didn't exist) and still isn't in Windows 10. – Eryk Sun Jul 28 '15 at 21:40
  • Thanks for the tip @eryksun - I didn't know about that. Shouldn't that be answer for this question? I guess it works with Python 3 Strings and Python 2 Unicode strings. – Alastair McCormack Jul 29 '15 at 07:51
0

In windows, just run it from Python IDLE GUI, instead of from the console window.