Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

23459 questions
11
votes
1 answer

Java REGEX code to validate Indian language characters not working?

Why the following code not working(resulting false) with Indian languages? System.out.println(Charset.forName("UTF-8").encode("అనువాద") …
Suren Raju
  • 2,760
  • 4
  • 20
  • 44
11
votes
2 answers

Unicode Encoding and decoding issues in QRCode

I am trying to generate UTF-8 QRCode so that I can encore accents and Unicode characters. To test it, I am using many decoding solution : http://zxing.org/w/decode.jspx - The zxing project also used in…
Natim
  • 15,199
  • 21
  • 80
  • 140
11
votes
2 answers

How can one find the Unicode codepoints that a font has glyphs for, on a Debian-based system?

From a scripting language (Python or Ruby, say) on a Debian-based system, I would like to find either one of: All the Unicode codepoints that a particular font has glyphs for All the fonts that have glyphs for a particular Unicode…
Mark Longair
  • 385,867
  • 66
  • 394
  • 320
11
votes
1 answer

Issue about 65533 � in C# text file reading

I created a sample app to load all special characters while copy pasting from Openoffice writer to Notepad. Double codes differs and when I try to load this. var lines = File.ReadAllLines("..\\ter34.txt"); This creates problem of 65533 Issue comes…
Aravind Srinivas
  • 213
  • 3
  • 8
  • 15
11
votes
3 answers

unicode text file output differs between XE2 and Delphi 2009?

When I try the code below there seem to be different output in XE2 compared to D2009. procedure TForm1.Button1Click(Sender: TObject); var Outfile:textfile; myByte: Byte; begin assignfile(Outfile,'test_chinese.txt'); Rewrite(Outfile); for…
Thomas
  • 365
  • 1
  • 2
  • 9
11
votes
4 answers

How to convert unicode accented characters to pure ascii without accents?

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the…
Wolf
  • 121
  • 1
  • 2
  • 9
11
votes
2 answers

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code. The rules that describe exactly what is allowed in G type literals, and what are allowed for identifiers are unclear. The IBM manual indicates that a G'....' literal must have a SHIFT-OUT…
Ira Baxter
  • 88,629
  • 18
  • 158
  • 311
11
votes
4 answers

read/write unicode data in MySql

I am using MySql DB and want to be able to read & write unicode data values. For example, French/Greek/Hebrew values. My client program is C# (.NET framework 3.5). How do i configure my DB to allow unicode? and how do I use C# to read/write values…
user123093
  • 2,147
  • 3
  • 16
  • 16
11
votes
1 answer

UnicodeEncodeError: 'ascii' codec can't encode characters

I have a dict that's feed with url response. Like: >>> d { 0: {'data': u'

found "\u62c9\u67cf \u591a\u516c \u56ed"

'} 1: {'data': u'

some other data

'} ... } While using xml.etree.ElementTree function on this data values (d[0]['data']) I…
theta
  • 21,223
  • 35
  • 106
  • 149
11
votes
3 answers

Python 3: Demystifying encode and decode methods

Let's say I have a string in Python: >>> s = 'python' >>> len(s) 6 Now I encode this string like this: >>> b = s.encode('utf-8') >>> b16 = s.encode('utf-16') >>> b32 = s.encode('utf-32') What I get from above operations is a bytes array -- that…
treecoder
  • 36,160
  • 18
  • 57
  • 89
11
votes
3 answers

UNICODE, UTF-8 and Windows mess

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering…
Murrgon
  • 315
  • 4
  • 12
11
votes
5 answers

Python3 convert Unicode String to int representation

As we all know, a computer works with numbers. I'm typing this text right now, the server makes a number out of it and when you want to read it, you'll get text from the server. How can I do this on my own? I want to encrypt something with my own…
user1703918
11
votes
1 answer

In haskell how can I uppercase a unicode character with respect to current locale

It turns out that uppercasing a character is a complicated business. If you get out of the basic ASCII character set, the rules for uppercasing a character and lowercasing a character are actually dependent on the locale in which the application is…
Savanni D'Gerinel
  • 2,209
  • 14
  • 25
11
votes
5 answers

Shouldn't JSON.stringify escape Unicode characters?

I have a simple test page in UTF-8 where text with letters in multiple different languages gets stringified to JSON: http://jsfiddle.net/Mhgy5/ HTML: