Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.
Unicode
Unicode assigns each character a code point to act as a unique reference:
- U+0041 A
- U+0042 B
- U+0043 C
- ...
- U+039B Λ
- U+039C Μ
Unicode Transformation Formats
UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).
Code Point UTF-8 UTF-16 (big-endian)
U+0041 41 00 41
U+0042 42 00 42
U+0043 43 00 43
...
U+039B CE 9B 03 9B
U+039C CE 9C 03 9C
Specification
The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.
Identifying Characters
For more general information, see the Unicode article on Wikipedia.