26

What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?

Virus721
  • 7,156
  • 8
  • 49
  • 110
  • 1
    What is it about [the wikipedia page](https://en.wikipedia.org/wiki/UTF-32) that is unclear? If there are ambiguities on that page, it would be useful to discuss them. – Norman Gray May 12 '15 at 09:29
  • What 'hate'? The question is completely answered by the Wikipedia page, so it's not a useful addition to this site. If there's something on that page that isn't clear (and much about Unicode is perplexing), then a more detailed question – which says for example 'This explanation seems to imply X, but this other part implies Y, which contradicts; so what's the resolution?' – would be a useful and instructive question. A question which doesn't display research, or other attempts by the questioner to answer it themself, is ... less so. – Norman Gray May 12 '15 at 12:29

2 Answers2

20

The Unicode Standard Version 8.0, Appendix C states:

UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in ISO 10646 (Universal Coded Character Set).

Jim U
  • 3,032
  • 1
  • 11
  • 23
Jonathan Maddox
  • 301
  • 2
  • 3
15

UTF-32 has started as a subset of UCS-4. Now it is identical except that the UTF-32 standard has additional Unicode semantics. See details on wikipedia:

The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.

Because only 17 planes are actually in use, all current code points are between 0 and 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, UTF-32 will be able to represent all Unicode characters. Accordingly, UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

However, I am not exactly sure, what additional Unicode semantics means. Maybe someone can provide a better answer.

BenMorel
  • 30,280
  • 40
  • 163
  • 285
Christian Gollhardt
  • 14,865
  • 16
  • 65
  • 99
  • I personaly don't know @一二三. Maybe we need a better answer, which has more information about this. – Christian Gollhardt Apr 20 '16 at 02:48
  • 1
    The Wikipedia article says "[clarification needed]". – Keith Thompson Apr 20 '16 at 02:54
  • 4
    Sounds to me like UCS-4 = [0,0x7FFFFFFF] while UTF-32 = [0,0x10FFFF]. Both are represented as 32 bits, but UTF-32 further restricts the range of legal values. – Bill Fraser Oct 28 '16 at 23:13
  • 1
    UTF contains additional properties such as right to left etc. https://en.wikipedia.org/wiki/Unicode_character_property. Otherwise the two are the same. – Ian Apr 23 '19 at 06:37
  • See http://www.unicode.org/faq/utf_bom.html#utf32-1: “UTF-32 is a subset of the encoding mechanism called UCS-4 in ISO 10646.” – hermannk Oct 04 '20 at 09:50