-2

UTF-32 has its last bits zeroed. As I understand it UTF-16 doesn't use all its bits either.

Is there a 16-bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7-bit?

phuclv
  • 27,258
  • 11
  • 104
  • 360
J Alan
  • 67
  • 10
  • 1
    So, you are asking for any character encoding (for any character set) that has 16-bit code units and uses all values 0 to 65535 as valid code units? Why? – Tom Blodget Nov 06 '18 at 17:08

1 Answers1

2

UTF-32 has its last bits zeroed

This might be not correct, depending on how you count. Typically we count from left, so the high (i.e. first) bits of UTF-32 will be zero

As I understand it UTF-16 doesn't use all its bits either

It's not correct either. UTF-16 uses all of its bits. It's just that the range [0xD800—0xDFFF] is reserved for UTF-16 surrogate pairs so those values will never be assigned any character and will never appear in UTF-32. If you need to encode characters outside the BMP with UTF-16 then those values will be used

In fact Unicode was limited to U+10FFFF just because of UTF-16, even though UTF-8 and UTF-32 themselves are able to represent up to U+7FFFFFFF and U+FFFFFFFF respectively. The use of surrogate pair makes it impossible to encode values larger than 0x10FFFF in UTF-16

See Why Unicode is restricted to 0x10FFFF?

Is there a 16 bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7 bit?

First there's no such thing as "a subset of UTF", since UTF isn't a character set but a way to encode Unicode code points

Prior to the existence of UTF-16 Unicode was a fixed 16-bit character set encoded with UCS-2. So UCS-2 might be the closest you'll get, which encodes only the characters in the BMP. Other fixed 16-bit non-Unicode charsets also has an encoding that maps all of the bit combinations to some characters

However why would you want that? UCS-2 has been deprecated long ago. Some old tools and less experienced programmers still imply that Unicode is always 16-bit long like that which is correct and will break modern text processing

Also note that not all the values below 0xFFFF are assigned, so no encoding can map every 16-bit value to a Unicode code point

Further reading

Community
  • 1
  • 1
phuclv
  • 27,258
  • 11
  • 104
  • 360