Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

800 questions
10
votes
2 answers

Content is not allowed in prolog

i'm trying to convert xml to html using xslt. Am using java.xml.transform to do this in java. it was working fine until i bumped into some xml. it said the following error. [Fatal Error] :1:1: Content is not allowed in prolog. …
Senthil Kumar
  • 8,157
  • 7
  • 33
  • 44
10
votes
2 answers

What most correct way to set the encoding in C++?

How it is best of all to set the encoding in C++? I got used to working with Unicode (and wchar_t, wstring, wcin, wcout and L" ... "). I also save source in UTF-8. At the moment I use MinGW (Windows 7) and run my program in Windows console…
shau-kote
  • 927
  • 2
  • 11
  • 21
9
votes
2 answers

Difference between readAsBinaryString and readAsText using FileReader

So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I…
gengkev
  • 1,750
  • 1
  • 18
  • 28
9
votes
3 answers

Is there any reason not to use UTF-8, 16, etc. for everything?

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been…
Joe Phillips
  • 44,686
  • 25
  • 93
  • 148
9
votes
2 answers

PDFBox U+00A0 is not available in this font's encoding

I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters. field.setValue("TEST-BY  (TEST)") In detail, if my value contains characters as U+00A0 i am getting the following…
assuna
  • 115
  • 1
  • 7
9
votes
2 answers

Python psycopg2 not in utf-8

I use Python to connect to my postgresql data base like this: conn=psycopg2.connect(database="fedour", user="fedpur", password="***", host="127.0.0.1", port="5432") No problem for that. But when I make a query and I want to print the cursor I have…
Fedour
  • 175
  • 1
  • 1
  • 12
9
votes
4 answers

SQL doesnt differentiate u and ü although collation is utf8mb4_unicode_ci

In a table x, there is a column with the values u and ü. SELECT * FROM x WHERE column='u'. This returns u AND ü, although I am only looking for the u. The table's collation is utf8mb4_unicode_ci . Wherever I read about similar problems, everyone…
Jakob
  • 91
  • 4
9
votes
3 answers

UTF conversion functions in C++11

I'm looking for a collection of functions for performing UTF character conversion in C++11. It should include conversion to and from any of utf8, utf16, and utf32. A function for recognizing byte order marks would be helpful, too.
Brent
  • 3,489
  • 3
  • 22
  • 55
9
votes
3 answers

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters. Here, 'japaneseString' is the web service parameter containing the characters in japanese language. String name = new…
Janak
  • 4,630
  • 4
  • 24
  • 42
9
votes
2 answers

Difference between UTF encodings?

I have a simple question - what is the difference between UTF-8, UTF-16 and UTF-32? I know that encoded strings have different sizes, but what is the UTF-16 and UTF-32 for?Should't UTF-8 be able to handle all languages correctly? And how does UTF-7…
Petr Mensik
  • 24,455
  • 13
  • 84
  • 111
8
votes
2 answers

UTF Encoding for Chinese CharactersJava

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0…
Maurice
  • 6,273
  • 13
  • 49
  • 75
8
votes
4 answers

jsp utf encoding

I'm having a hard time figuring out how to handle this problem: I'm developing a web tool for an Italian university, and I have to display words with accents (such as è, ù, ...); sometimes I get these words from a PostgreSql table (UTF8-encoded),…
nicolamontecchio
  • 103
  • 1
  • 1
  • 6
8
votes
1 answer

Git cant diff or merge .cs file in utf-16 encoding

A friend and I were working on the same .cs file at the same time and when there's a merge conflict git points out there's a conflict but the file isnt loaded with the usual "HEAD" ">>>" stuff because the .cs files were binary files. So we added…
user1879789
  • 292
  • 3
  • 8
8
votes
2 answers

Why is sys.getdefaultencoding() different from sys.stdout.encoding and how does this break Unicode strings?

I spent a few angry hours looking for the problem with Unicode strings that was broken down to something that Python (2.7) hides from me and I still don't understand. First, I tried to use u".." strings consistently in my code, but that resulted in…
Aleksandar Savkov
  • 2,634
  • 3
  • 20
  • 30
7
votes
1 answer

Char to UTF code in vbscript

I'd like to create a .properties file to be used in a Java program from a VBScript. I'm going to use some strings in languages that use characters outside the ASCII map. So, I need to replace these characters for its UTF code. This would be \u0061…
Carlos Blanco
  • 8,092
  • 15
  • 63
  • 97
1 2
3
53 54