Questions tagged [utf]

Unicode Transformation Format (8/16/32/...) used for encoding Unicode code points

defines abstract CodePoints and their interactions. It also defines multiple s for storage and exchange of those CodePoints. All of them can express all valid Unicode CodePoints, though they have different size, compatibility, expressiveness for invalid data and efficiency characteristics.

  • (people sometimes only write UTF for this encoding), can encode all valid and invalid sequences in the other encodings, as well as being an ascii superset. If there is no compelling compatibility constraint, this encoding is preferred.
  • Used only for international domain names. (historical contenders were utf-5 and utf-6)
  • GB18030 is the official chinese encoding.
  • UTF-EBCDIC should fill the role of utf-8 for Ebcdic system but never caught on.
  • This encoding was designed for systems which are not 8bit-clear like old email, but never gained much popularity even there.

The following encodings have 3 variants: big-endian, little-endian and any-endian with BOM.

  • () Early adopters who embraced when people thought 64k are enough moved to this encoding. Beside orphaned surrogates, one cannot encode bad utf-8 or utf-32 sequences as utf-16. Also, it is rarely more space-efficient than utf-8, nor is it fixed width (not even utf-32 really is).
  • (identical to ucs4 aka modern ) This is the 1 CodeUnit per CodePoint encoding. Due to combining CodePoints negating this only questionable benefit, and huge storage demand, it is seldom used even for internal representation.

Resources

800 questions
7
votes
4 answers

Spanish characters in Android Studio

I've got a problem with Android Estudio, i'm trying to develope an application but the characters like "¿" or "ñ" and "á,é,ó,í,ú" don't appear correctly when i run the application. I've tried to solve the problem changing the encoding to UTF-8 but…
Dv Apps
  • 143
  • 2
  • 9
7
votes
4 answers

Do I need supplementary plane?

I think the question is pretty simple, do I need all the rest of the stuff in Unicode after the basic plane? What kind of stuff is included and is that really needed? (and for what purposes?) Thanks.
Tower
  • 87,855
  • 117
  • 329
  • 496
6
votes
1 answer

MSBuild.exe output encoding

I use MSBuild.exe for building solution on machine with russian language. But in TeamCity build log all russian chars in wrong encoding. How to setup MSBuild.exe for properly output (UTF-8 for example)?
Dmitriy Kudinov
  • 993
  • 5
  • 22
  • 30
6
votes
1 answer

python3: bytes vs bytearray, and converting to and from strings

I'd like to understand about python3's bytes and bytearray classes. I've seen documentation on them, but not a comprehensive description of their differences and how they interact with string objects.
fearless_fool
  • 29,889
  • 20
  • 114
  • 193
6
votes
2 answers

Reading UTF-8 with BOM in ruby 2.5.0

Is there a way to read files encoded in UTF-8 with BOM (Byte order marks) on Ruby v2.5.0? On Ruby 2.3.1 this used to work: csv = CSV.open(file_path, encoding: 'bom|utf-8') However, on 2.5.0 the following error ocurrs: ArgumentError: unknown…
romeu.hcf
  • 63
  • 7
6
votes
3 answers

What is a surrogate pair?

I came across this code in a javascript open source project. validator.isLength = function (str, min, max) // match surrogate pairs in string or declare an empty array if none found in string var surrogatePairs =…
Noman Ur Rehman
  • 5,973
  • 2
  • 21
  • 35
6
votes
5 answers

UTF usage in C++ code

What is the difference between UTF and UCS. What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for: Internal representation inside the code For string manipulation…
Martin York
  • 234,851
  • 74
  • 306
  • 532
6
votes
4 answers

PHP MySQL database strange characters

I'm trying to output product information stored in a MySQL database, but it's writing out some strange characters, like a diamond with a question mark inside of it. I think it may be an encoding/UTF8 issue, but I've specified the encoding I…
user231733
5
votes
1 answer

how strings are stored by python in computers?

I believe most of you who are familiar with Python have read Dive Into Python 3. In chapter 4.3, it says this: In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python…
endless
  • 97
  • 1
  • 4
5
votes
1 answer

Response.WriteFile() Strange characters issue

Hello in my aspx page using MVC 3, I have the following code: <%Response.WriteFile("/Content/Bing.htm"); %> Which is an include file that contains BING search box code. At the top of the containing DIV, a strange character is appearing:  I…
Cyberdrew
  • 1,680
  • 1
  • 17
  • 35
5
votes
3 answers

idn_to_ascii() in 5.2.17

There's a very handy function idn_to_ascii() in PHP 5.3, but I'm running 5.2.17 and I can't change that. How do I encode Unicode domain names to ascii then?
donk
  • 1,442
  • 4
  • 17
  • 44
5
votes
1 answer

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined…
Gherman
  • 4,957
  • 6
  • 36
  • 58
5
votes
2 answers

Syllabification of Devanagari

I am trying to syllabify devanagari words धर्मक्षेत्रे -> धर् मक् षेत् रे dharmakeshetre -> dhar mak shet re wd.split('्') I get the result as : ['धर', 'मक', 'षेत', 'रे'] Which is partially correct I try another word कुरुक्षेत्र -> कु रुक् षेत्…
Echchama Nayak
  • 1,266
  • 2
  • 20
  • 36
5
votes
3 answers

Persist UTF-8 as Default Encoding

I tried to persist UTF-8 as the default encoding in Python. I tried: >>> import sys >>> sys.getdefaultencoding() 'ascii' And I also tried: >>> import sys >>> reload(sys) >>> sys.setdefaultencoding('UTF8') >>>…
DenCowboy
  • 10,114
  • 24
  • 80
  • 168
5
votes
4 answers

Delphi: Encoding Strings as Python do

I want to encode strings as Python do. Python code is this: def EncodeToUTF(inputstr): uns = inputstr.decode('iso-8859-2') utfs = uns.encode('utf-8') return utfs This is very simple. But in Delphi I don't understand, how to encode, to force…
durumdara
  • 3,071
  • 4
  • 40
  • 66