Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

What's the difference between encoding and charset?

I am confused about the text encoding and charset. For many reasons, I have to learn non-Unicode, non-UTF8 stuff in my upcoming work. I find the word "charset" in email headers as in "ISO-2022-JP", but there's no such a encoding in text editors. (I…

encoding character-encoding

asked Feb 17 '10 at 14:55

TK.

23,367
19
60
72

158

votes

7 answers

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface. Due to incorrect encoding, a piece of my string looks like this in Spanish: AcciÃ³n whereas it should…

c# string encoding utf-8 character-encoding

asked Dec 27 '12 at 15:56

Gaara

1,767
2
12
15

156

votes

12 answers

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded. The main problem for me is that I don't know what encoding the source of any string is going to be…

php utf-8 character-encoding

asked Nov 02 '11 at 11:27

Grim...

15,141
7
41
59

156

votes

3 answers

Change the encoding of a file in Visual Studio Code

Is there any way to change the encoding of a file? For example UTF-8 to ISO 8859-1? Setting Example Sublime Text: "default_encoding": "UTF-8"

visual-studio-code character-encoding vscode-settings

asked May 06 '15 at 16:43

Fernando Tholl

1,997
2
12
12

155

votes

23 answers

How do I remove ï»¿ from the beginning of a file?

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following characters prepended to it: ï»¿ PHP removes all whitespace, so a random ï»¿ in the middle of…

php utf-8 character-encoding byte-order-mark mojibake

asked Jul 15 '10 at 13:35

Matt

10,197
24
77
109

148

votes

9 answers

Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file…

git unicode character-encoding diff utf-16

asked Apr 22 '09 at 15:51

skiphoppy

83,104
64
169
214

146

votes

16 answers

Java : How to determine the correct charset encoding of a stream

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly What is the best way to programatically determine the correct charset encoding of an inputstream/file ? I have tried using the following: File in = …

java file encoding stream character-encoding

asked Jan 31 '09 at 15:34

Joel

27,478
33
104
136

142

votes

13 answers

How to change the default encoding to UTF-8 for Apache?

I am using a hosting company and it will list the files in a directory if the file index.html is not there, it uses iso-8859-1 as the default encoding. If the server is Apache, is there a way to set UTF-8 as the default instead? Update: Additionally…

apache character-encoding apache-config

asked May 27 '09 at 04:04

nonopolarity

130,775
117
415
675

136

votes

8 answers

How to support UTF-8 encoding in Eclipse

How can I add UTF-8 support in eclipse? I want to add for example Russian language but eclipse won't support it. What should I do? Please guide me.

eclipse encoding utf-8 character-encoding

asked Feb 07 '12 at 17:35

Katty

1,627
3
12
17

133

votes

10 answers

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?

mysql character-encoding

asked Dec 30 '08 at 22:54

Ed Mays

1,580
4
12
12

132

votes

16 answers

Who sets response content-type in Spring MVC (@ResponseBody)

I'm having in my Annotation driven Spring MVC Java web application runned on jetty web server (currently in maven jetty plugin). I'm trying to do some AJAX support with one controller method returning just String help text. Resources are in UTF-8…

java web-applications spring-mvc character-encoding

asked Sep 01 '10 at 08:49

Hurda

4,468
8
30
49

130

votes

2 answers

Changing PowerShell's default output encoding to UTF-8

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful. I'm looking to change it to UTF-8. It can be done on a case-by-case basis by replacing the…

powershell utf-8 character-encoding

asked Oct 18 '16 at 02:54

rwallace

26,045
30
102
195

129

votes

13 answers

How to check if a String contains only ASCII?

The call Character.isLetter(c) returns true if the character is a letter. But is there a way to quickly find if a String only contains the base characters of ASCII?

java string character-encoding ascii

asked Aug 27 '10 at 14:19

TambourineMan

1,293
2
8
5

114

votes

6 answers

Is ASCII code 7-bit or 8-bit?

My teacher told me ASCII is 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fit into 7-bits. So can't it be argued that ASCII bit is actually 7-bit code? And what do we mean to say at all when saying ASCII…

character-encoding ascii

asked Feb 04 '13 at 15:42

Anurag Kalia

4,113
4
18
26

113

votes

3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…

unicode utf-8 character-encoding multibyte

asked Oct 09 '09 at 13:02

dsimard

3,865
4
19
16

Prev 1 2

…

99 100 Next

Questions tagged [character-encoding]

Which Character Encoding is This?

Common Questions

See Also