Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.
Character encoding is the act or result of representing characters (human-readable text/symbols such as a
or 汉
or ) as a series of bytes (computer-readable zeroes and ones).
Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0
could represent the text â‰
in Windows code page 1252, or Б┴═
in KOI8-R, or the character ≠ in UTF-8.
A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.
Which Character Encoding is This?
Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.
Bad: "I look at the text and I see óòÒöô, what is this"?
Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text>
in <language>
. A hex dump of the beginning of the file shows
000000 9e 9f 9a a0 af b4 be f0 9e af b3 f2 20 b7 5f 20
Bad: Anything which tries to use the term "ANSI" in this context2
Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.
Better: Specify the precise code page
Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.
Notice:
We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.
Common Questions
- What is character encoding and why should I bother with it
- What's  sign at the beginning of my source file?
- How can I detect the encoding/codepage of a text file
- Unicode, UTF, ASCII, ANSI format differences
- How can I find the character code of a special character in my text editor?
1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2
is the hex representation of the byte 11100010
.
2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.