Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

14318 questions
160
votes
11 answers

What's the difference between encoding and charset?

I am confused about the text encoding and charset. For many reasons, I have to learn non-Unicode, non-UTF8 stuff in my upcoming work. I find the word "charset" in email headers as in "ISO-2022-JP", but there's no such a encoding in text editors. (I…
TK.
  • 23,367
  • 19
  • 60
  • 72
158
votes
7 answers

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface. Due to incorrect encoding, a piece of my string looks like this in Spanish: Acción whereas it should…
Gaara
  • 1,767
  • 2
  • 12
  • 15
156
votes
12 answers

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded. The main problem for me is that I don't know what encoding the source of any string is going to be…
Grim...
  • 15,141
  • 7
  • 41
  • 59
156
votes
3 answers

Change the encoding of a file in Visual Studio Code

Is there any way to change the encoding of a file? For example UTF-8 to ISO 8859-1? Setting Example Sublime Text: "default_encoding": "UTF-8"
Fernando Tholl
  • 1,997
  • 2
  • 12
  • 12
155
votes
23 answers

How do I remove  from the beginning of a file?

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following characters prepended to it:  PHP removes all whitespace, so a random  in the middle of…
Matt
  • 10,197
  • 24
  • 77
  • 109
148
votes
9 answers

Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file…
skiphoppy
  • 83,104
  • 64
  • 169
  • 214
146
votes
16 answers

Java : How to determine the correct charset encoding of a stream

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly What is the best way to programatically determine the correct charset encoding of an inputstream/file ? I have tried using the following: File in = …
Joel
  • 27,478
  • 33
  • 104
  • 136
142
votes
13 answers

How to change the default encoding to UTF-8 for Apache?

I am using a hosting company and it will list the files in a directory if the file index.html is not there, it uses iso-8859-1 as the default encoding. If the server is Apache, is there a way to set UTF-8 as the default instead? Update: Additionally…
nonopolarity
  • 130,775
  • 117
  • 415
  • 675
136
votes
8 answers

How to support UTF-8 encoding in Eclipse

How can I add UTF-8 support in eclipse? I want to add for example Russian language but eclipse won't support it. What should I do? Please guide me.
Katty
  • 1,627
  • 3
  • 12
  • 17
133
votes
10 answers

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
Ed Mays
  • 1,580
  • 4
  • 12
  • 12
132
votes
16 answers

Who sets response content-type in Spring MVC (@ResponseBody)

I'm having in my Annotation driven Spring MVC Java web application runned on jetty web server (currently in maven jetty plugin). I'm trying to do some AJAX support with one controller method returning just String help text. Resources are in UTF-8…
Hurda
  • 4,468
  • 8
  • 30
  • 49
130
votes
2 answers

Changing PowerShell's default output encoding to UTF-8

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful. I'm looking to change it to UTF-8. It can be done on a case-by-case basis by replacing the…
rwallace
  • 26,045
  • 30
  • 102
  • 195
129
votes
13 answers

How to check if a String contains only ASCII?

The call Character.isLetter(c) returns true if the character is a letter. But is there a way to quickly find if a String only contains the base characters of ASCII?
TambourineMan
  • 1,293
  • 2
  • 8
  • 5
114
votes
6 answers

Is ASCII code 7-bit or 8-bit?

My teacher told me ASCII is 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fit into 7-bits. So can't it be argued that ASCII bit is actually 7-bit code? And what do we mean to say at all when saying ASCII…
Anurag Kalia
  • 4,113
  • 4
  • 18
  • 26
113
votes
3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…
dsimard
  • 3,865
  • 4
  • 19
  • 16