Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points
unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.
- utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
- punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
- GB18030 is the official chinese encoding.
- UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
- utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.
The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.
- utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
- utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.