Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

unicode defines abstract CodePoints and their interactions. It also defines multiple encodings for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

utf-8 (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
punycode Used only for international domain names. (historical contenders were utf-5 and utf-6)
GB18030 is the official chinese encoding.
UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
utf-7 This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

utf-16 (utf-16le) Early adopters who embraced ucs2 when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
utf-32 (identical to ucs4 aka modern ucs) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

Wikipedia on Unicode

800 questions

votes

2 answers

Content is not allowed in prolog

i'm trying to convert xml to html using xslt. Am using java.xml.transform to do this in java. it was working fine until i bumped into some xml. it said the following error. [Fatal Error] :1:1: Content is not allowed in prolog. …

java xml parsing utf

asked Aug 03 '10 at 10:19

Senthil Kumar

8,157
7
33
44

votes

2 answers

What most correct way to set the encoding in C++?

How it is best of all to set the encoding in C++? I got used to working with Unicode (and wchar_t, wstring, wcin, wcout and L" ... "). I also save source in UTF-8. At the moment I use MinGW (Windows 7) and run my program in Windows console…

c++ windows unicode encoding utf

asked Apr 05 '13 at 04:57

shau-kote

votes

2 answers

Difference between readAsBinaryString and readAsText using FileReader

So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I…

javascript string html utf typed-arrays

asked Feb 19 '12 at 02:11

gengkev

1,750
1
18
28

votes

3 answers

Is there any reason not to use UTF-8, 16, etc. for everything?

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been…

character-encoding utf

asked Jan 15 '11 at 00:00

Joe Phillips

44,686
25
93
148

votes

2 answers

PDFBox U+00A0 is not available in this font's encoding

I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters. field.setValue("TEST-BY (TEST)") In detail, if my value contains characters as U+00A0 i am getting the following…

java pdf unicode pdfbox utf

asked Sep 28 '17 at 13:16

assuna

votes

2 answers

Python psycopg2 not in utf-8

I use Python to connect to my postgresql data base like this: conn=psycopg2.connect(database="fedour", user="fedpur", password="***", host="127.0.0.1", port="5432") No problem for that. But when I make a query and I want to print the cursor I have…

python sql psycopg2 utf

asked Apr 24 '17 at 08:33

Fedour

votes

4 answers

SQL doesnt differentiate u and ü although collation is utf8mb4_unicode_ci

In a table x, there is a column with the values u and ü. SELECT * FROM x WHERE column='u'. This returns u AND ü, although I am only looking for the u. The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone…

mysql sql utf-8 utf utf8mb4

asked Dec 12 '16 at 13:50

Jakob

votes

3 answers

UTF conversion functions in C++11

I'm looking for a collection of functions for performing UTF character conversion in C++11. It should include conversion to and from any of utf8, utf16, and utf32. A function for recognizing byte order marks would be helpful, too.

c++ c++11 unicode data-conversion utf

asked Jul 31 '16 at 21:10

Brent

3,489
3
22
55

votes

3 answers

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters. Here, 'japaneseString' is the web service parameter containing the characters in japanese language. String name = new…

java encoding utf-8 character-encoding utf

asked Jun 03 '14 at 07:13

Janak

4,630
4
24
42

votes

2 answers

Difference between UTF encodings?

I have a simple question - what is the difference between UTF-8, UTF-16 and UTF-32? I know that encoded strings have different sizes, but what is the UTF-16 and UTF-32 for?Should't UTF-8 be able to handle all languages correctly? And how does UTF-7…

encoding utf

asked Jun 10 '12 at 17:28

Petr Mensik

24,455
13
84
111

votes

2 answers

UTF Encoding for Chinese CharactersJava

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0…

java encoding utf

asked Jul 27 '11 at 01:20

Maurice

6,273
13
49
75

votes

4 answers

jsp utf encoding

I'm having a hard time figuring out how to handle this problem: I'm developing a web tool for an Italian university, and I have to display words with accents (such as è, ù, ...); sometimes I get these words from a PostgreSql table (UTF8-encoded),…

java jsp encoding utf

asked Jan 28 '09 at 16:59

nicolamontecchio

votes

1 answer

Git cant diff or merge .cs file in utf-16 encoding

A friend and I were working on the same .cs file at the same time and when there's a merge conflict git points out there's a conflict but the file isnt loaded with the usual "HEAD" ">>>" stuff because the .cs files were binary files. So we added…

c# git utf

asked Aug 07 '13 at 19:24

user1879789

votes

2 answers

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in…

python stdout utf sys

asked Mar 20 '13 at 17:29

Aleksandar Savkov

2,634
3
20
30

votes

1 answer

Char to UTF code in vbscript

I'd like to create a .properties file to be used in a Java program from a VBScript. I'm going to use some strings in languages that use characters outside the ASCII map. So, I need to replace these characters for its UTF code. This would be \u0061…

vbscript utf

asked Feb 10 '10 at 23:28

Carlos Blanco

8,092
15
63
97

Prev 1 2

…

53 54 Next