0

I keep getting an error and I'm not sure on how to fix it.

The Code line:

if not len(lines) or lines[-1] == '' or lines[-1] == '▁':
    lines = list(filter(lambda line: False if line == '' or line == '▁' else True, list(lines)))

Output: SyntaxError: Non-ASCII character '\xe2' in file prepare_data.py on line 512, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

cs95
  • 274,032
  • 76
  • 480
  • 537
Sunny Suri
  • 1
  • 1
  • 2
  • 1
    I know your question isn't about the code you've written but that needs some reworking too. – cs95 Jun 13 '18 at 04:53
  • Can you clarify your question?, give a better context of the code, which of these two lines is the 512?, maybe you have problems with the encoding and you need use utf8 in your code – Cyberguille Jun 13 '18 at 05:25
  • 1
    Also, what version of Python are you using? UTF-8 is the default encoding for source files starting in 3.4. (Thanks @IljaEverilä) – tripleee Jun 13 '18 at 06:09
  • The relevant PEP: https://www.python.org/dev/peps/pep-3120/, and a somewhat similar Q/A: https://stackoverflow.com/questions/6289474/working-with-utf-8-encoding-in-python-source – Ilja Everilä Jun 13 '18 at 06:14

2 Answers2

2

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the bytes in the string which displays as a funky underscore.

If you want to match U+2581 then you can say

.... or lines[-1] == '\u2581':

which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to match a regular ASCII underscore, that's ASCII 95 / U+005F; here are the two characters side by side for easy comparison and possible copy/paste:

U+2581 ▁  _ U+005F

The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be

# coding=utf-8

or the Emacs-compatible

# -*- encoding: utf-8 -*-

If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow tag has a tag info page with more information and some troubleshooting tips.

In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#e2 shows 21 possible interpretations for the byte 0xE2 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. In fact, I would guess you are actually using UTF-8, which represents this character as the three bytes 0xE2 0x96 0x81; but without also seeing the character rendered as something resembling an underscore, there would be absolutely no way to guess this for a human, either.

tripleee
  • 139,311
  • 24
  • 207
  • 268
  • I adapted this as an answer to the question which I'm thinking this should be closed as a duplicate of; https://stackoverflow.com/a/50831670/874188 – tripleee Jun 13 '18 at 07:44
0

Try this. I haven't tested it, but I think it might solve your encoding problem. Your code needs some improvements for readability, remember the Zen of Python please.

def filter_line(line):
    if not line or line == '▁':
        return False
    else:
        return True

lines = [line.encode("utf-8") for line in lines]

if not lines or lines[-1] == '' or lines[-1] == '▁':
    lines = list(filter(filter_lines, list(lines)))
spikespaz
  • 1,594
  • 2
  • 19
  • 42
  • 1
    The encoding problem is in the code. You have copy/pasted the problematic character into your code, too. – tripleee Jun 13 '18 at 06:00