-1

So here I am, I read about encoding all day, now I need some clarification.

First off I'm using eclipse mars with pydev.

Unicode is a (character set + code points), basicaly a table of symbols associated with numerical value. The way those value are going to be stored at a binary level are defined by the encoding, let's say UTF-8.

1 : shebang

What is the shebang for? when I put # -*- coding: utf-8 -*- does it do something? or does it just indicate that my file is encoded in UTF-8 (but since it's just an indication it could be a lie :o)

2 : Eclipse file encoding

After I wrote my shebang and saved I went into the property of the file, and it said encoding : ISO-8859-1, so my guess is that the shebang does nothing beside indicate in which encoding my file is. Do I need to manually set every files to UTF-8 or is there a way to teach eclipse to read the shebang and act accordingly.

3 : Why does the shebang only specify the encoding?

My shebang say utf-8, ok right, so what? it does not tell me which caracter set is used. Since UTF-8 is just an encoding I could use UTF-8 with any character set no? I could encode ASCII in UTF-8 if I wanted, since an encoding is just a way to convert and store/read code points. What if my character set encoded in utf-8 does not have the same code points than unicode? (is this possible?)

4 : maybe a solution?

I oftenly read that utf-8 is an implementation of unicode, does that mean that each times you read encoding = UTF-8 you can be 100%, and I say 100%, sure that the characterset+code points is unicode?

I'm lost

Heetola
  • 4,431
  • 7
  • 26
  • 44
  • Script description line add is not the answer. Modules between the encoding is undefined. Python automatically gets the encoding of the current system. Your system has to be in utf-8 encoding. Set your os parameter as utf-8 coding system. Python gets the information from local encoding each time it runs. If you want to work with UTF-8, Unicode won't do you any good. – dsgdfg Sep 15 '15 at 20:36
  • please, limit your questions to a single issue per question – jfs Sep 16 '15 at 13:20

3 Answers3

2

There are multiple misconceptions in your question.

Unicode is a standard that is commonly used for working with text. It is not "character set + code points" e.g., Unicode standard defines how to find word boundaries or how to compare Unicode string.

# -*- coding: utf-8 -*- is an encoding declaration. It is not a shebang. Shebang (as it name suggests) starts with #! e.g., #! /usr/bin/env python.

You might need the encoding declaration if there are non-ascii literal characters in your Python source code e.g., you don't need an encoding declaration if you write:

#!/usr/bin/env python2
print u"\N{SNOWMAN}"

But you need it if you use literal non-ascii characters:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print u"☃"

Both scripts produce the same output if the second script is saved using utf-8 encoding. The encoding declaration says how to interpret bytes that constitute the Python source code to get the program text.

"is there a way to teach eclipse to read the shebang encoding declaration and act accordingly." is a good separate question. If IDE has explicit Python support then it should do it automatically.

My shebang encoding declaration say utf-8, ok right, so what? it does not tell me which character set is used.

"character encoding", codepage, and charset may be used interchangeably in many contexts. See What's the difference between encoding and charset? The distinctions are irrelevant for the task of converting from bytes to text and back in Python:

unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)

A bytestring is an immutable sequence of bytes in Python (roughly speaking numbers in 0..255 range) that is used to represent arbitrary binary data e.g., images, zip-archives, encrypted data, and text encoded using some character encoding. A Unicode string is an immutable sequence of Unicode codepoints (roughly speaking, numbers in 0..sys.maxunicode range) that is used to represent text in Python.

Some character encodings such as cp437 support only a few Unicode characters. Others such as utf-8 support the full range of Unicode codepoints.

Community
  • 1
  • 1
jfs
  • 346,887
  • 152
  • 868
  • 1,518
0

The right way to add the encoding declaration is > # -*- coding: utf-8 -*- It tells python to change the encoding in the current script to UTF-8 it has nothing to do with the user .

J.Dev
  • 56
  • 7
  • That's what I meant, I wrote my question too fast (I will edit). So the python interpreter will read the script as utf-8, so I DO HAVE to set utf-8 in my IDE properties, otherwhise python will try to read utf-8 in a file encoded in something else – Heetola Sep 15 '15 at 20:09
0

Ok I think I found an awnser to all those questions

1/ thanks to J.Dev, the shebang only tells the python interpreter in what the file is encoded, but YOU have to encode the file in what you put in the shebang

2/ Apparently I have to do it manually

3/ Because an encoding is associated with a charset, if you say encoding=utf-8 then it will always be a unicode charset

Some old 1 byte charset don't have encoding, you don't need encoding since the char are all stored on 1 byte, the natural binary translation is the encoding.

So when you say ASCII for instance you mean the charset and encoding = ASCII

But this leave me wondering, is there other type of charset out there with multiple encoding implementation (like unicode can be encoded in utf-8/16/32)

Heetola
  • 4,431
  • 7
  • 26
  • 44