Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

800 questions

544

votes

13 answers

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?

unicode utf-8 utf-16 utf utf-32

asked Jan 30 '09 at 17:05

user60456

376

votes

2 answers

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings? In what way are these helpful for programmers?

unicode character-encoding ascii ansi utf

asked Mar 31 '09 at 06:02

web dunia

8,634
17
47
63

141

votes

5 answers

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these? MessageDigest md = MessageDigest.getInstance("SHA-256"); String text = "This is some text"; md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed byte[] digest =…

java unicode utf-8 utf-16 utf

asked Jan 11 '11 at 07:38

theJava

13,380
40
125
166

140

votes

15 answers

Which encoding opens CSV files correctly with Excel on both Mac and Windows?

We have a web app that exports CSV files containing foreign characters with UTF-8, no BOM. Both Windows and Mac users get garbage characters in Excel. I tried converting to UTF-8 with BOM; Excel/Win is fine with it, Excel/Mac shows gibberish. I'm…

windows excel macos csv utf

asked Jul 05 '11 at 19:50

Timm

2,314
2
18
23

votes

13 answers

<0xEF,0xBB,0xBF> character showing up in files. How to remove them?

I am doing compressing of JavaScript files and the compressor is complaining that my files have ï»¿ character in them. How can I search for these characters and remove them?

file unicode utf-8 utf

asked Sep 04 '11 at 07:20

Quintin Par

14,646
27
87
142

votes

1 answer

Unicode encoding for string literals in C++11

Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types: char a = '\x30'; //…

c++ unicode c++11 utf string-literals

asked Jul 22 '11 at 21:07

Kerrek SB

428,875
83
813
1,025

votes

6 answers

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char…

unicode utf-8 utf

asked May 07 '11 at 21:28

Ufuk Hacıoğulları

36,026
11
106
149

votes

9 answers

Android WebView with garbled UTF-8 characters.

I'm using some webviews in my android app, but are unable to make them display in utf-8 encoding. If use this one I won't see my scandinavian charcters: mWebView.loadUrl("file:///android_asset/om.html") And if try this one, I won't get anything…

android webview utf

asked Feb 08 '11 at 12:36

elwis

1,355
2
17
33

votes

5 answers

What's the point of UTF-16?

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then…

utf-8 character-encoding utf-16 utf utf-32

asked Mar 13 '11 at 20:28

dsimcha

64,236
45
196
319

votes

6 answers

Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?

Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" question, but for such a simple question the answer still eludes me after checking both sources. I am pretty sure that all three of…

unicode character-encoding utf

asked Sep 24 '08 at 22:51

JohnFx

33,720
18
99
158

votes

5 answers

ISO-8859-1 vs UTF-8?

What should be used and when ? or is it always better to use UTF-8 always? or ISO-8859-1 still has importance in specific conditions? Is Character-set related to geographic region? Edit: Is there any benefit to put this code @charset "utf-8"; or…

css xhtml unicode utf

asked Dec 12 '09 at 16:43

Jitendra Vyas

134,556
218
544
822

votes

3 answers

Enter Unicode characters with 8-digit hex code

How do I enter Unicode characters like without copying it to the clipboard and pasting it? Things I know: The command ga on the character gives me hex:0001d4ed. I can copy it on the clipboard and paste it via "+p. I know how to enter Unicode…

vim unicode utf-16 utf

asked Feb 02 '12 at 20:39

epsilonhalbe

14,841
5
38
71

votes

2 answers

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question. My question is: Is there a field in a PDF file where, by convention, the encoding scheme is…

pdf unicode utf

asked May 18 '12 at 16:14

Louis Thibault

16,122
21
72
136

votes

4 answers

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" ) In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated.…

ruby encoding character-encoding ruby-1.9 utf

asked Jan 03 '12 at 09:57

StefanH

votes

2 answers

What is the difference between UTF-32 and UCS-4?

What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?

string unicode encoding char utf

asked May 12 '15 at 09:18

Virus721

7,156
8
49
110

2 3

…

53 54 Next