0

I'm working with Visual Studio Code under Lubuntu 18.04. The file encoding in VS Code is configured to be UTF-8, and the Python scripts have the encoding set to utf-8:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

The Python files contain some non-ASCII characters like in this example docstring:

"""
'Al final pudimos reparar el problema de registro de datos y se pudieron montar los
equipos para recoger algún dato más. ...'
"""

If executing the scripts I get the following error:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xfa in position 141: invalid start byte

Here is the traceback:

Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/USERNAME/.vscode/extensions/ms-python.python-2020.6.90262/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/USERNAME/.vscode/extensions/ms-python.python-2020.6.90262/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/USERNAME/.vscode/extensions/ms-python.python-2020.6.90262/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 267, in run_file
    runpy.run_path(options.target, run_name=compat.force_str("__main__"))
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/runpy.py", line 261, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/runpy.py", line 236, in _get_code_from_file
    code = compile(f.read(), fname, 'exec')
  File "/home/USERNAME/Desktop/Python/Scripts/General/Import_export/import_EXCEL_spreadsheet_data_write_to_CSV.py", line 338

None of the numerous proposals worked for me, since this error is thrown when executing any Python script containing non-ASCII characters even in comments or docstrings.

Andreas L.
  • 850
  • 3
  • 20
  • 1
    It appears that your file isn't actually saved as UTF-8. Either save it in that encoding, or change the encoding comment to match the file. – jasonharper Jun 26 '20 at 13:10
  • How do I save a python script as UTF-8 automatically when creating it the first time? Normally, I work with VS Code, open a new window, copy my standard code header from another python file into it to start with, save it and that's it. – Andreas L. Jun 26 '20 at 13:50
  • I just tried something out, which revealed indeed a change in the encoding of the saved file: Command line `file -i filename` gave me `text/x-python; charset=utf-8`, but when I changed one character like `a` to `á` and saved the file again, the output was `text/x-python; charset=iso-8859-1`. Does this mean I have to put `# -*- coding: iso-8859-1 -*-` instead of `# -*- coding: utf-8 -*` in the beginning of each `python` file? – Andreas L. Jun 26 '20 at 13:54
  • Settings - Text Editor - Files - Encoding -> utf8 – MrBean Bremen Jun 26 '20 at 13:58
  • Thanks MrBean Bremen, I've just checked this option in VS Code and it was already put to utf8. I don't understand why it changes automatically the charset to "iso-8859-1" upon saving a textfile containing a non-ASCII character such as an "á". – Andreas L. Jun 26 '20 at 14:22
  • A file with only ASCII characters is valid UTF-8. ASCII is a subset of UTF-8. Without a non-ASCII character, the `file` command couldn't tell that it was not UTF-8. `á` encoded in UTF-8 is `b'\xc3\xa1'` but in ISO-8859-1 it is `b'\xe1'`. The `#coding:` line must match the actual encoding. – Mark Tolonen Jun 26 '20 at 15:50
  • This sounds like a bug in VSCode to me. I just tested this in VSCode unter Windows (created a new Python file, added some non-ASCII text, and saved), and it was saved correctly in UTF-8, even without the coding line (the file encoding is set to utf8, same as in your case) - and we know that UTF-8 is not the default encoding in Windows. You did save the file in VSCode, right? – MrBean Bremen Jun 26 '20 at 17:22
  • @MrBeanBremen exactly, I created and saved it in VS Code (using Linux Lubuntu 18.04 LTS). Technically, I do everything in VS Code. – Andreas L. Jun 26 '20 at 18:15
  • Well, just out of interest I just installed VSCode under Ubuntu 18.04, installed the Python extension, verified that file encoding is set to utf8, created a Python file with some umlauts and saved it. It has been correctly saved as utf-8. So, apart from using Ubuntu instead of Lubuntu this looks the same as your setup. There is also the possibility to set the encoding per language (or so it says), but I doubt that you have changed that setting. No idea... – MrBean Bremen Jun 27 '20 at 14:24
  • Thanks for your effort, I've just posted an intermittent answer to this issue being my personal workaround to avoid the error from occurring. Until no one finds a better solution to the initial cause of VS Code acting apparently a bit buggy, this will be the answer. – Andreas L. Jun 28 '20 at 09:47
  • Does the file you copy/paste from have a Latin-1 encoding? – tripleee Jun 28 '20 at 11:47
  • No, it's a "utf8"-file and when I open this python-file in `VS Code`, then insert an "ä" or "á" somewhere within a comment or string, save it, it ends up being a `iso-8859-1`, which is equivalent to `latin-1`. This happens even though my file header states "utf-8" and the settings in VS Code are also set to "utf8". That's why the other already existent answers, such as https://stackoverflow.com/questions/10589620/syntaxerror-non-ascii-character-xa3-in-file-when-function-returns-%c2%a3 , don't help me as I've already implemented everything in the manner how it should be working, but it doesn't. – Andreas L. Jun 28 '20 at 12:52
  • What does the editor say in the bottom right corner of VS Code is the file's encoding? And do you have an overriding setting for Python specifically (this setting can be specified per-language)? – Brett Cannon Jun 30 '20 at 00:25
  • Thanks @BrettCannon for giving me the crucial hint. By seeing the wrong encoding on the bottom right corner, I figured the rest out myself, which I delineated in the answer I've just posted. – Andreas L. Jun 30 '20 at 07:53

2 Answers2

0

EDIT after finding the cause of the issue

The actual answer can be found above. Nevertheless, by assuming that some information could still be useful, I'm not going to delete what was stated below in the original workaround and related comments.

Original workaround

Until someone comes up with a more elegant and to-the-point approach, I'll share my workaround which resolved the problem for me:

  1. Identify which languages you actually need: In my case, I code in English, but sometimes it could be that I need to type something in German (Umlaute like "ääöüöü") or Spanish (accents like "á, ú, ..") which imply certain latin-non-ASCII characters.
  2. Since utf-8 doesn't work for unknown reasons on my system outlined in the comments pertaining to my question, replace # -*- coding: utf-8 -*- with # -*- coding: latin-1 -*- in all python-scripts where it's needed, or even as default.

Alternatives how to go about accomplishing step 2 are the following:

  • Search all files in directory within VS Code via the magnifying glass on the top right, or using a shortcut. Hereby, it might be interesting to exclude the **/.history/** folder-pattern like so.
  • Use the CLI-approach sed for search and replace tasks.

By and large, this made it work for me. I can know type "ääöüöü" and "á, ú, ..." in my scripts and save them without getting any errors in VS Code.

Andreas L.
  • 850
  • 3
  • 20
  • I'm abstaining from downvoting, but this is really poor workaround. You really need to switch to a system where UTF-8 works everywhere. – tripleee Jun 28 '20 at 10:02
  • I don't know how to prevent `VS Code` from acting buggy. The option "utf8" is put everywhere by default in the settings of `VS Code`, and nevertheless, upon saving, the file charset is switched to "iso-8859-1" automatically when adding Umlauts (äöü..) or diacritics (accents like á, ú, ..). I confirmed this unexpected behavior via CLI command `file -i filename`. – Andreas L. Jun 28 '20 at 10:21
  • Until this is not resolved, the workaround is valid for me and people speaking/employing "most West European languages such as Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, German, Galician, Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish" (quote from https://www.terena.org/activities/multiling/ml-docs/iso-8859.html#:~:text=Latin%201%20covers%20most%20West,Portuguese%2C%20Spanish%2C%20and%20Swedish. ). I consider this acceptable given that one wants to make things work until a more elegant and generic solution will provided by someone else. – Andreas L. Jun 28 '20 at 10:22
  • Yeah, if that works for you it is a feasible workaround. I would file a bug with VS Code anyway - this seems to be a specific problem under Lubuntu. – MrBean Bremen Jun 28 '20 at 18:29
  • I edited the question to clarify that it is not a duplicate. – MrBean Bremen Jun 28 '20 at 19:08
  • In fact, I've filed a bug 1 day ago on the GitHub-page of VS Code: https://github.com/microsoft/vscode/issues/101227 Thanks for editing, I hope @tripleee will acknowledge that and remove "duplicate" from my question while reopening it. – Andreas L. Jun 29 '20 at 09:10
  • 1
    Sure, I cast the final reopen vote, though you were close to getting it reopened without my help (3 reopen votes is enough). Reopening automatically removes the duplicate marking. – tripleee Jun 29 '20 at 10:17
0

Finally, I found the cause of the entire problem: It lied in the settings.json - file, where autoguess-encoding was set to true:

"files.autoGuessEncoding": true

This option is able to override "files.encoding": "utf8", so even if you have defined a preferred encoding, VS Code is capable of guessing another encoding. By virtue of the valuable hint of Brett Cannon I detected that indeed in the bottom right corner of VS Code the file's encoding was sometimes (automatically) put to Windows 1252. This unfortunate guess of VS Code's option "files.autoGuessEncoding": true led to the common errors mentioned above in my initial question (provided that I inserted Umlauts ("äöü..") or diacritics ("éúá..") somewhere in my script):

  1. Getting the error message in pylint right after insertion: "error while code parsing: Wrong or no encoding specified for script.py."
  2. Next, running the script produces the mentioned SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xfa in position 141: invalid start byte

As stated in the discussion related to the aforementioned link, VS Code is still somewhat inaccurate when it comes to detecting the adequate file-encoding, which I can confirm.

To resolve this problem at last, avoid autodetection by putting the following 2 lines in your settings.json (or set the associated options in the settings-GUI of VS Code):

{...,

    "files.encoding": "utf8",
    "files.autoGuessEncoding": false,
...

}

Now, it is possible to place any character of desire within the text-file or script, such as Umlauts ("äöü..") and diacritics ("éúá..").

Finally, it is noteworthy that the above-mentioned settings won't change the encoding of already previously created and saved files. For this to happen, you need to left-click on the encoding on the bottom right in the VS Code window, then either reopen or save with your desired encoding, which will most likely be utf8.

As an aside regarding the settings, note that you can also change these settings via the GUI under File -> Preferences -> Settings instead of using the settings.json - file (via Ctrl + Shift + P and then "Preferences: Open Settings (JSON)".

Andreas L.
  • 850
  • 3
  • 20