-1

How many bytes does '\x80' in UTF8 occupy?

In Python I write:

>>> '\x80'.encode('utf8')
b'\xc2\x80'

This indicates that '\x80' translates to two bytes.

Also, the other way around:

>>> b'\x80'.decode()
>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Does this mean the byte '\x80' on its own has no (character) meaning in UTF-8?

halloleo
  • 6,554
  • 6
  • 46
  • 88
  • Note: UTF-8 is an encoding, and independent of Python. A byte string is a list of bytes, so if you define just one byte, it will take one byte (+ overhead of python (e.g. type information, length). Strings in python has own representation (and it depends on python version), but you should probably ignore such implementation detail. `len` on a string gives the number of `unicode codepoints` (not the bytes). – Giacomo Catenazzi Jan 08 '20 at 09:10

1 Answers1

6

The unicode character U+0080 (which is a control character: PAD) is encoded as two bytes in UTF-8, which are 0xC2 and 0x80.

The byte stream containing only the byte 0x80 does not represent a valid UTF-8 encoding of anything on its own (i.e. that byte alone is a misformed UTF-8 stream).

Basically each byte in a UTF-8 stream can be classified as one of three different types:

  • single-byte sequences: bytes in the range 0x00-0x7F (0-127) represent a single Unicode Codepoint on their own (this part is equivalent to the old US-ASCII encoding)
  • leading bytes: bytes in the range 0xC0-0xFD (192-253) start a multi-byte sequence and indicate how long that sequence must be *
  • continuation bytes: bytes in the range 0x80-0xBF (128-191) are the rest of a multi-byte sequence.

0x80 is a continuation byte, so it can't stand on its own (it has to be preceded with a leading byte and possibly some other continuation bytes to be valid).

The Wikipedia article on UTF-8 has some very extensive documentation with good examples.

* Note that some leading bytes can never appear in valid UTF-8 for various reasons, so the range of actually used leading bytes is smaller. Similarly 0xFE and 0xFF are just never used so can also never appear in valid UTF-8.

Joachim Sauer
  • 278,207
  • 54
  • 523
  • 586