Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

14318 questions
5
votes
1 answer

Choose default type and encoding for C++ string literals at compile time

C++11 introduced the new string literals for UTF-8, 16 and 32 with the u8, u and U prefixes but I have to hard code which one I want to use. I'm looking for a way to select which encoding I want to use at compile time (similar to how a typedef…
DrYap
  • 6,097
  • 2
  • 26
  • 54
5
votes
2 answers

Spring form and controller UTF-8 bad encoding

I have a problem with my UTF-8 encoding. My webapp uses french words that are correctly displayed in my jsp, but not in my controller after a POST. For example, in my jsp I have: Prénom de mon père and when I post the form, the controller…
thibon
  • 260
  • 2
  • 7
  • 17
5
votes
3 answers

Does std::ctype always classify characters by the "C" locale?

cppreference says std::ctype provides character classification based on the classic "C" locale. Is this even true when we create a locale like this: std::locale loc(std::locale("en_US.UTF8"), new std::ctype); Will the facet of loc still…
template boy
  • 9,266
  • 4
  • 51
  • 91
5
votes
2 answers

Print a list that contains Chinese characters in Python

My code looks like : # -*- coding: utf-8 -*- print ["asdf", "中文"] print ["中文"] print "中文" The output in the Eclipse console is very strange: ['asdf', '\xe4\xb8\xad\xe6\x96\x87'] ['\xe4\xb8\xad\xe6\x96\x87'] 中文 My first question is: why did the…
user958547
5
votes
3 answers

Perl's default string encoding and representation

In the following: my $string = "Can you \x{FB01}nd my r\x{E9}sum\x{E9}?\n"; The x{FB01} and x{E9} are code points. And code points are encoded via an encoding scheme to a series of octets. So the character è which has the codepoint \x{FB01} is…
Cratylus
  • 49,824
  • 60
  • 195
  • 327
5
votes
2 answers

How can I check if a String is encodable in some encoding?

The following test fails on converted Latin1, because illegal characters are replaced with byte with the value 63 (question mark). The problem is that these characters should better cause some exception ... @Test public void testEncoding()…
dmatej
  • 1,200
  • 12
  • 22
5
votes
4 answers

charset for spanish windows

what is the charset for spanish windows?
Valentina
  • 51
  • 1
  • 1
  • 2
5
votes
1 answer

Is it possible to "sniff" the Character encoding?

I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding. Is there a way to…
shabda
  • 1,500
  • 15
  • 27
5
votes
3 answers

COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'?

mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci; ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary' How do I get rid of this error? What I already tried (copy&paste): $ mysql -u admin -p…
feklee
  • 7,203
  • 9
  • 48
  • 67
5
votes
3 answers

Should this char be unsigned?

I found some confusing code during code review and am a bit puzzled. Doing some research I found this situation. I wrote this sample of code to highlight the problem char d = '©';// this is -87,the copyright symbol , (actually its 169…
Andrew Keith
  • 7,222
  • 1
  • 22
  • 39
5
votes
2 answers

Prob. on Hebrew encoding

I have a hebrew text just as "×گض¸×¨ض´×™×،ض°×کוض¹×ں", and I want to convert it to readable unicode hebrew charackters. I tried this code: const string Str = "×گض¸×¨ض´×™×،ض°×کוض¹×ں"; Encoding enc1 = Encoding.Default; Encoding enc2 =…
JustMe
  • 79
  • 6
5
votes
2 answers

How to set charset to .js file in MVC ScriptBundle?

I have script.js file that contains several string in cyrillic. When i attempt to load this with standart link like this: cyrillic letters become…
Barada
  • 242
  • 2
  • 9
5
votes
2 answers

UTF-8 French accented characters issue

When i see data as stored on mysql database using phpmyadmin, the characters are stored exactly as é à ç however when i use php to display these data on an html document that has the exact following structure:
Mbarry
  • 241
  • 1
  • 2
  • 13
5
votes
1 answer

Character encoding in IDEA output of AssertionError

I am using IntelliJ IDEA 12.0.4. Have some tests. When i'm running one using JUnit4 framework my Assertion Error looks like: java.lang.AssertionError: Status should be: Черновик expected [true] but found [false] If i am using a TestNG it look like…
QAutomatron
  • 296
  • 3
  • 15
5
votes
2 answers

Cannot set SQLiteDatabase encoding to anything other than UTF-8

I'm working on a problem where I need to attach one sqlite database to another. One of the databases is created by my app and the other is downloaded from a remote server. I'd prefer to get it in a legitimate data format (like JSON), but I can't…
Krylez
  • 15,934
  • 4
  • 29
  • 41
1 2 3
99
100