Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

23459 questions
11
votes
2 answers

Serializing supplementary unicode characters into XML documents with Java

I am trying to serialize DOM documents with supplementary unicode characters such as U+1D49C (𝒜, mathematical script capital A). Creating a node with such a character is not a problem (I just set the node value to the UTF-16 equivalent,…
Damien
  • 2,532
  • 1
  • 20
  • 28
11
votes
2 answers

Dictionary with keys in unicode

Is it possible in Python to use Unicode characters as keys for a dictionary? I have Cyrillic words in Unicode that I used as keys. When trying to get a value by a key, I get the following traceback: Traceback (most recent call last): File…
KoirN
  • 326
  • 1
  • 3
  • 14
11
votes
2 answers

String#encode not fixing "invalid byte sequence in UTF-8" error

I know there are multiple similar questions about this error, and I've tried many of them without luck. The problem I'm having involves the byte \xA1 and is throwing ArgumentError: invalid byte sequence in UTF-8 I've tried the following with no…
joshm1
  • 533
  • 1
  • 10
  • 20
11
votes
1 answer

Convert unicode string to byte string

I get a string from a function that is represented like u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0', but to process it I need it to be bytestring (like '\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'). How do I convert it without changes? My best guess…
Alexander Egurnov
  • 177
  • 1
  • 2
  • 9
11
votes
2 answers

Python: Creating a Unicode string

I have a problem in Python with Unicode. I need plot a graph with Unicode annotations in it. According to the tutorial I should just create my string in Unicode. I do it like this: annotation = u"%s has %s rev"%(art.title, len(art.revisions)) It is…
ashim
  • 20,778
  • 27
  • 68
  • 89
10
votes
5 answers

Cross-platform C++: Use the native string encoding or standardise across platforms?

We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux. We can't decide whether the…
Jesse Pepper
  • 3,182
  • 25
  • 45
10
votes
2 answers

What does the expression \X match when inside a RegEx?

According to http://www.regular-expressions.info, You can consider \X the Unicode version of the dot in regex engines that use plain ASCII. Does this mean that it will match any possible Unicode code point?
federico-t
  • 11,157
  • 16
  • 58
  • 108
10
votes
1 answer

Django coercing to Unicode: need string or buffer, datetime.date found

I have a model: class MyModel(models.Model): id = models.IntegerField(primary_key=True) recorded_on = models.DateField() precipitation = models.FloatField(null=True, blank=True) in my views I have a query thus: import datetime def…
Darwin Tech
  • 16,209
  • 34
  • 102
  • 173
10
votes
3 answers

Not able to send UTF-8 email using delphi indy

Here is my code Email body has got some unicode characters LSMTP := TIdSMTP.Create(nil); try LMsg := TIdMessage.Create(LSMTP); try with LMsg do begin Subject := Subj; Recipients.EMailAddresses := Email; …
ETL Man
  • 245
  • 2
  • 7
10
votes
2 answers

How do I convert a byte array to a string?

I have a byte that is an array of 30 bytes, but when I use BitConverter.ToString it displays the hex string. The byte is 0x42007200650061006B0069006E00670041007700650073006F006D0065. Which is in Unicode as well. It means…
Ian Lundberg
  • 1,635
  • 8
  • 27
  • 46
10
votes
3 answers

Eclipse CDT: 'can't find a source file' while debugging

I'm using Eclipse with CDT for C++ development. However, I'm forced to use ASCII-symbols in paths to my source files to succesfully debug my programs. When source files are located in folders with non-English characters in their names, Eclipse gives…
Igor Shalyminov
  • 674
  • 2
  • 7
  • 21
10
votes
3 answers

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices? I am…
user1254893
  • 509
  • 3
  • 13
10
votes
3 answers

python byte string encode and decode

I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json. b = '\x80' u8 = b.encode('utf-8') j = json.dumps(u8) I expected j to be '\xc2\x80' but instead I…
kung-foo
  • 216
  • 1
  • 3
  • 7
10
votes
3 answers

Would std::basic_string be preferable to std::wstring on Windows?

As I understand it, Windows #defines TCHAR as the correct character type for your application based on the build - so it is wchar_t in UNICODE builds and char otherwise. Because of this I wondered if std::basic_string would be preferable to…
Matt Ryan
  • 251
  • 2
  • 10
10
votes
2 answers

"Delphi Fundamentals" in Delphi 2009

Has anybody used/converted "Delphi Fundamentals" in Delphi 2009? - http://fundementals.sourceforge.net/ I'm using Dictionaries (cArrays.pas,cDictionaries.pas,cStrings.pas,cTypes.pas) in my project and now i have some troubles on upgrading code. I'll…
J K Kunil
  • 531
  • 1
  • 5
  • 13
1 2 3
99
100