Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

800 questions
544
votes
13 answers

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
user60456
376
votes
2 answers

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings? In what way are these helpful for programmers?
web dunia
  • 8,634
  • 17
  • 47
  • 63
141
votes
5 answers

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these? MessageDigest md = MessageDigest.getInstance("SHA-256"); String text = "This is some text"; md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed byte[] digest =…
theJava
  • 13,380
  • 40
  • 125
  • 166
140
votes
15 answers

Which encoding opens CSV files correctly with Excel on both Mac and Windows?

We have a web app that exports CSV files containing foreign characters with UTF-8, no BOM. Both Windows and Mac users get garbage characters in Excel. I tried converting to UTF-8 with BOM; Excel/Win is fine with it, Excel/Mac shows gibberish. I'm…
Timm
  • 2,314
  • 2
  • 18
  • 23
88
votes
13 answers

<0xEF,0xBB,0xBF> character showing up in files. How to remove them?

I am doing compressing of JavaScript files and the compressor is complaining that my files have  character in them. How can I search for these characters and remove them?
Quintin Par
  • 14,646
  • 27
  • 87
  • 142
86
votes
1 answer

Unicode encoding for string literals in C++11

Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types: char a = '\x30'; //…
Kerrek SB
  • 428,875
  • 83
  • 813
  • 1,025
85
votes
6 answers

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char…
Ufuk Hacıoğulları
  • 36,026
  • 11
  • 106
  • 149
76
votes
9 answers

Android WebView with garbled UTF-8 characters.

I'm using some webviews in my android app, but are unable to make them display in utf-8 encoding. If use this one I won't see my scandinavian charcters: mWebView.loadUrl("file:///android_asset/om.html") And if try this one, I won't get anything…
elwis
  • 1,355
  • 2
  • 17
  • 33
74
votes
5 answers

What's the point of UTF-16?

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then…
dsimcha
  • 64,236
  • 45
  • 196
  • 319
52
votes
6 answers

Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?

Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" question, but for such a simple question the answer still eludes me after checking both sources. I am pretty sure that all three of…
JohnFx
  • 33,720
  • 18
  • 99
  • 158
50
votes
5 answers

ISO-8859-1 vs UTF-8?

What should be used and when ? or is it always better to use UTF-8 always? or ISO-8859-1 still has importance in specific conditions? Is Character-set related to geographic region? Edit: Is there any benefit to put this code @charset "utf-8"; or…
Jitendra Vyas
  • 134,556
  • 218
  • 544
  • 822
37
votes
3 answers

Enter Unicode characters with 8-digit hex code

How do I enter Unicode characters like without copying it to the clipboard and pasting it? Things I know: The command ga on the character gives me hex:0001d4ed. I can copy it on the clipboard and paste it via "+p. I know how to enter Unicode…
epsilonhalbe
  • 14,841
  • 5
  • 38
  • 71
28
votes
2 answers

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question. My question is: Is there a field in a PDF file where, by convention, the encoding scheme is…
Louis Thibault
  • 16,122
  • 21
  • 72
  • 136
27
votes
4 answers

Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" ) In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated.…
StefanH
  • 403
  • 1
  • 5
  • 7
26
votes
2 answers

What is the difference between UTF-32 and UCS-4?

What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?
Virus721
  • 7,156
  • 8
  • 49
  • 110
1
2 3
53 54