0

When I run this code on Juyputer notebook it creates a list and gets rid of the UTF-8 BOM at the start of the file. But when I run it in Python3.6 on eclipse it throws up this error.

File "C:\Users\msjho\eclipse-workspace\MITX\src\root\nested\parsertext.py", line 10, in <module>
    print(stuff)
  File "C:\Users\msjho\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1897-1898: character maps to <undefined>

The file is a plain/text file downloaded from google drive where it has undergone optical character recognition to take the text from a .png file

I have been only coding a month or so I may be doing something daft

with open('D:/MarketAppData/procScreencaps/2019_01_28_04_15_24.txt','r', encoding ="UTF-8") as read_file1:
    stuff =read_file1.read()
stuff=stuff.split()

if stuff[0].isalnum() == False:
    stuff.pop(0)
print(stuff)
tripleee
  • 139,311
  • 24
  • 207
  • 268
Lucullus
  • 23
  • 4
  • We need more information. Please see https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors and then [edit] your post to update it. – tripleee Jan 30 '19 at 12:52
  • Specifically post the full error message *including the line that caused it*. Could it be the `print` line? – Serge Ballesta Jan 30 '19 at 12:54
  • thanks for feedback i have added the complete error message. It does appear to be the print line, but it works fine on Jupyter notebooks? – Lucullus Jan 30 '19 at 13:05
  • What's in the file at the indicated position; is it actually valid UTF-8? Can you obtain a hex dump of a few bytes around the problematic spot? – tripleee Jan 30 '19 at 13:10
  • What does "gets rid of the UTF-8 tag" mean? – tripleee Jan 30 '19 at 13:11
  • well the read() creates a string. I split the string up. The first character is always this '\ufeff________________'. Which is not info i am interested in. If i print the list before removing this character It fails on eclipse but works on Juypter. So the removal of the character has no effect – Lucullus Jan 30 '19 at 13:18
  • So it removes the BOM? Just the BOM, or also a sequence of underscores? – tripleee Jan 30 '19 at 13:19
  • the first 10 or so characters output are ['ENR', 'ENRPA', 'EWW', 'PLD', 'REXR', 'O', 'FDN', 'DRE', 'EWW', 'VNQI', 'EWW', 'SIRI', 'c', 'DVN', 'ERIC', 'INTC', 'PBR', 'SIRI', 'FHN', 'UAL', 'ITUB', 'SYY', 'PFE', 'PYPL', 'MU', 'KMI', 'BHGE – Lucullus Jan 30 '19 at 13:22
  • The error message indicates that your system uses Windows code page 1252. Presumably the Unicole string contains characters which this codec cannot represent. Whether you want to fix the system encoding to be able to accommodate Unicode, or discard or replace characters which didn't exist in 1991 is really up to you. – tripleee Jan 30 '19 at 13:24
  • Yeah it removes the \ufeff and a sequence of underscores the split statement makes these the first item in the list. I just pop them. But regardless of that the code works fine on Jupyter and not when i run it in eclipse with python3.6 – Lucullus Jan 30 '19 at 13:25
  • So when i run the code on Jupyter it doesnt use this windows code page 1252 i assume? – Lucullus Jan 30 '19 at 13:27
  • Your Jupyter clearly is able to handle Unicode output. If that's what you want on the console as well, configure Windows to do the same thing somehow. – tripleee Jan 30 '19 at 13:28
  • Thanks you have been very helpful – Lucullus Jan 30 '19 at 13:29
  • Please consider accepting the duplicate nomination if this got you onto the right track. – tripleee Jan 30 '19 at 13:31
  • It turns out you can change the encoding scheme inside Eclipse itself. The following link shows you how. https://z0ltan.wordpress.com/2011/12/25/changing-the-encoding-in-eclipse-to-utf-8-howto/ – Lucullus Jan 30 '19 at 14:23
  • Sorry if i didnt mark as a duplicate @triplee. Not sure how to and maybe the issue is more easily sorted in eclipse as shown above – Lucullus Jan 30 '19 at 14:25
  • If reading UTF-8 files with BOM, use `utf-8-sig` to read the file and automatically detect and remove the BOM. – Mark Tolonen Jan 31 '19 at 05:57

0 Answers0