Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

14318 questions
316
votes
25 answers

Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1. Unfortunately, there are sometimes problems with the…
caw
  • 29,212
  • 58
  • 168
  • 279
287
votes
17 answers

Is there an upside down caret character?

I have to maintain a large number of classic ASP pages, many of which have tabular data with no sort capabilities at all. Whatever order the original developer used in the database query is what you're stuck with. I want to to tack on some basic…
Joel Coehoorn
  • 362,140
  • 107
  • 528
  • 764
267
votes
13 answers

How to convert Strings to and from UTF8 byte arrays in Java

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
mcherm
  • 20,782
  • 10
  • 41
  • 50
262
votes
11 answers

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Here is my code, for line in open('u.item'): # Read each line Whenever I run this code it gives the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte I tried to solve this and…
SujitS
  • 8,381
  • 3
  • 14
  • 38
244
votes
10 answers

What is a vertical tab?

What was the original historical use of the vertical tab character (\v in the C language, ASCII 11)? Did it ever have a key on a keyboard? How did someone generate it? Is there any language or system still in use today where the vertical tab…
dmazzoni
  • 12,008
  • 4
  • 35
  • 34
242
votes
18 answers

How do you echo a 4-digit Unicode character in Bash?

I'd like to add the Unicode skull and crossbones to my shell prompt (specifically the 'SKULL AND CROSSBONES' (U+2620)), but I can't figure out the magic incantation to make echo spit it, or any other, 4-digit Unicode character. Two-digit one's are…
masukomi
  • 8,507
  • 8
  • 35
  • 45
236
votes
8 answers

Writing Unicode text to a text file?

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page). It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source? Currently…
simon
  • 5,629
  • 13
  • 29
  • 27
234
votes
10 answers

What is ANSI format?

What is ANSI encoding format? Is it a system default format? In what way does it differ from ASCII?
web dunia
  • 8,634
  • 17
  • 47
  • 63
219
votes
4 answers

Write to UTF-8 file in Python

I'm really confused with the codecs.open function. When I do: file = codecs.open("temp", "w", "utf-8") file.write(codecs.BOM_UTF8) file.close() It gives me the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal…
John Jiang
  • 9,721
  • 11
  • 46
  • 60
218
votes
15 answers

Do I really need to encode '&' as '&'?

I'm using an '&' symbol with HTML5 and UTF-8 in my site's . Google shows the ampersand fine on its SERPs, as do all the browsers in their titles. http://validator.w3.org is giving me this: & did not start a character reference. (& probably…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/validation" class="post-tag grid--cell" title="show questions tagged 'validation'" rel="tag">validation</a> <a href="../../questions/tagged/html" class="post-tag grid--cell" title="show questions tagged 'html'" rel="tag">html</a> <a href="../../questions/tagged/utf-8" class="post-tag grid--cell" title="show questions tagged 'utf-8'" rel="tag">utf-8</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Aug 16 '10 at 13:09">asked Aug 16 '10 at 13:09</time> <a href="../../users/289666/haroldo" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/289666.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Haroldo" onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/289666/haroldo" class="s-user-card--link">Haroldo</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">33,209</li> <li class="s-award-bling s-award-bling__gold" title="46 gold badges">46</li> <li class="s-award-bling s-award-bling__silver" title="123 silver badges">123</li> <li class="s-award-bling s-award-bling__bronze" title="164 bronze badges">164</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-8218230"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>214</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>13</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly" class="question-hyperlink">PHP DOMDocument loadHTML not encoding UTF-8 correctly</a></h3> <div class="excerpt">I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me). $profile = "<div><p>various japanese characters</p></div>"; $dom = new DOMDocument(); $dom->loadHTML($profile);…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/php" class="post-tag grid--cell" title="show questions tagged 'php'" rel="tag">php</a> <a href="../../questions/tagged/utf-8" class="post-tag grid--cell" title="show questions tagged 'utf-8'" rel="tag">utf-8</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Nov 21 '11 at 20:37">asked Nov 21 '11 at 20:37</time> <a href="../../users/519204/slightly-a" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/519204.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Slightly A." onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/519204/slightly-a" class="s-user-card--link">Slightly A.</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">2,319</li> <li class="s-award-bling s-award-bling__gold" title="2 gold badges">2</li> <li class="s-award-bling s-award-bling__silver" title="14 silver badges">14</li> <li class="s-award-bling s-award-bling__bronze" title="10 bronze badges">10</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-1684040"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>213</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>6</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/1684040/why-charset-names-are-not-constants" class="question-hyperlink">Why charset names are not constants?</a></h3> <div class="excerpt">Charset issues are confusing and complicated by themselves, but on top of that you have to remember exact names of your charsets. Is it "utf8"? Or "utf-8"? Or maybe "UTF-8"? When searching internet for code samples you will see all of the above. Why…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/java" class="post-tag grid--cell" title="show questions tagged 'java'" rel="tag">java</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Nov 05 '09 at 22:18">asked Nov 05 '09 at 22:18</time> <a href="../../users/20128/serg" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/20128.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="serg" onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/20128/serg" class="s-user-card--link">serg</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">103,023</li> <li class="s-award-bling s-award-bling__gold" title="70 gold badges">70</li> <li class="s-award-bling s-award-bling__silver" title="299 silver badges">299</li> <li class="s-award-bling s-award-bling__bronze" title="324 bronze badges">324</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-2365411"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>183</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>11</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/2365411/convert-unicode-to-ascii-without-errors-in-python" class="question-hyperlink">Convert Unicode to ASCII without errors in Python</a></h3> <div class="excerpt">My code just scrapes a web page, then converts it to Unicode. html = urllib.urlopen(link).read() html.encode("utf8","ignore") self.response.out.write(html) But I get a UnicodeDecodeError: Traceback (most recent call last): File…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/unicode" class="post-tag grid--cell" title="show questions tagged 'unicode'" rel="tag">unicode</a> <a href="../../questions/tagged/utf-8" class="post-tag grid--cell" title="show questions tagged 'utf-8'" rel="tag">utf-8</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> <a href="../../questions/tagged/ascii" class="post-tag grid--cell" title="show questions tagged 'ascii'" rel="tag">ascii</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Mar 02 '10 at 17:52">asked Mar 02 '10 at 17:52</time> <a href="../../users/279695/themirror" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/279695.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="themirror" onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/279695/themirror" class="s-user-card--link">themirror</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">8,937</li> <li class="s-award-bling s-award-bling__gold" title="6 gold badges">6</li> <li class="s-award-bling s-award-bling__silver" title="38 silver badges">38</li> <li class="s-award-bling s-award-bling__bronze" title="71 bronze badges">71</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-447107"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>182</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>6</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/447107/what-is-the-difference-between-encode-decode" class="question-hyperlink">What is the difference between encode/decode?</a></h3> <div class="excerpt">I've never been sure that I understand the difference between str/unicode decode and encode. I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/string" class="post-tag grid--cell" title="show questions tagged 'string'" rel="tag">string</a> <a href="../../questions/tagged/unicode" class="post-tag grid--cell" title="show questions tagged 'unicode'" rel="tag">unicode</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> <a href="../../questions/tagged/python-2.x" class="post-tag grid--cell" title="show questions tagged 'python-2.x'" rel="tag">python-2.x</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Jan 15 '09 at 15:13">asked Jan 15 '09 at 15:13</time> <a href="../../users/41613/koiu" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/41613.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="ʞɔıu" onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/41613/koiu" class="s-user-card--link">ʞɔıu</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">43,326</li> <li class="s-award-bling s-award-bling__gold" title="30 gold badges">30</li> <li class="s-award-bling s-award-bling__silver" title="94 silver badges">94</li> <li class="s-award-bling s-award-bling__bronze" title="142 bronze badges">142</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-2526033"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>180</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>4</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/2526033/why-specify-charset-utf-8-in-your-css-file" class="question-hyperlink">Why specify @charset "UTF-8"; in your CSS file?</a></h3> <div class="excerpt">I've been seeing this instruction as the very first line of numerous CSS files that have been turned over to me: @charset "UTF-8"; What does it do, and is this at-rule necessary? Also, if I include this meta tag in my "head" element, would that…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/css" class="post-tag grid--cell" title="show questions tagged 'css'" rel="tag">css</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Mar 26 '10 at 19:16">asked Mar 26 '10 at 19:16</time> <a href="../../users/223134/rsturim" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/223134.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="rsturim" onerror="onImageLoadingError(this);" /> </a> <div class="s-user-card--info"> <a href="../../users/223134/rsturim" class="s-user-card--link">rsturim</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">6,388</li> <li class="s-award-bling s-award-bling__gold" title="14 gold badges">14</li> <li class="s-award-bling s-award-bling__silver" title="43 silver badges">43</li> <li class="s-award-bling s-award-bling__bronze" title="58 bronze badges">58</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="s-pagination pager fr"> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=1" rel="prev" title="Go to page 1">Prev </a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=1" rel="" title="Go to page 1">1</a> <div class="s-pagination--item is-selected">2</div> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=3" rel="" title="Go to page 3">3</a> <div class="s-pagination--item s-pagination--item__clear">…</div> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=99" rel="" title="Go to page 99">99</a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=100" rel="" title="Go to page 100">100</a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=3" rel="next" title="Go to page 3"> Next</a> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script> /* replace <time class="fromnow" /> with human delta between `datetime` attr and now */ document.addEventListener('DOMContentLoaded', function(){ var time_elements = document.querySelectorAll("time.fromnow"); for (var i=0; i<time_elements.length; i++) { time_elements[i].innerHTML = moment(time_elements[i].getAttribute("datetime")).fromNow(); } }); </script> </body> </html>