Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

23459 questions
1380
votes
31 answers

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup. The problem is that the error is not always reproducible; it sometimes works with some pages, and…
Homunculus Reticulli
  • 54,445
  • 72
  • 197
  • 297
1377
votes
7 answers

Why is executing Java code in comments with certain Unicode characters allowed?

The following code produces the output "Hello World!" (no really, try it). public static void main(String... args) { // The comment below is not a typo. // \u000d System.out.println("Hello World!"); } The reason for this is that the Java…
Reg
  • 9,618
  • 6
  • 28
  • 46
1298
votes
19 answers

What characters can be used for up/down triangle (arrow without stem) for display in HTML?

I'm looking for a HTML or ASCII character which is a triangle pointing up or down so that I can use it as a toggle switch. I found ↑ (↑), and ↓ (↓) - but those have a narrow stem. I'm looking just for the HTML arrow "head".
Timj
  • 13,137
  • 3
  • 16
  • 9
1171
votes
8 answers

What's the difference between utf8_general_ci and utf8_unicode_ci?

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?
KahWee Teng
  • 12,350
  • 3
  • 19
  • 21
1000
votes
9 answers

What does the 'b' character do in front of a string literal?

Apparently, the following is the valid syntax: my_string = b'The string' I would like to know: What does this b character in front of the string mean? What are the effects of using it? What are appropriate situations to use it? I found a related…
Jesse Webb
  • 36,395
  • 25
  • 99
  • 138
889
votes
21 answers

What's the difference between UTF-8 and UTF-8 without BOM?

What's different between UTF-8 and UTF-8 without a BOM? Which is better?
simple
  • 9,023
  • 3
  • 15
  • 11
810
votes
12 answers

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions: When should I use std::wstring over std::string? Can…
Appu
719
votes
7 answers

What exactly do "u" and "r" string flags do, and what are raw string literals?

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks. I know what an encoding is, and I know what u'' alone does since I get what is Unicode. But what does r'' do…
e-satis
  • 515,820
  • 103
  • 283
  • 322
711
votes
2 answers

How does Zalgo text work?

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a…
Mike
  • 54,052
  • 71
  • 166
  • 213
699
votes
15 answers

How do I see what character set a MySQL database / table / column is?

What is the (default) charset for: MySQL database MySQL table MySQL column
Rory
  • 48,706
  • 67
  • 174
  • 234
690
votes
10 answers

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error: Traceback (most recent call last): File "SCRIPT LOCATION", line NUMBER,…
Eden Crow
  • 10,402
  • 9
  • 24
  • 24
597
votes
15 answers

Twitter image encoding challenge

If a picture's worth 1000 words, how much of a picture can you fit in 140 characters? Note: That's it folks! Bounty deadline is here, and after some tough deliberation, I have decided that Boojum's entry just barely edged out Sam Hocevar's. I will…
Brian Campbell
  • 289,867
  • 55
  • 346
  • 327
590
votes
12 answers

Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence

Sample code: >>> import json >>> json_string = json.dumps("ברי צקלה") >>> print(json_string) "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4" The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps…
Berry Tsakala
  • 11,390
  • 11
  • 52
  • 71
585
votes
10 answers

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics). I found on the web an elegant way to do this (in Java): convert the Unicode string to its long normalized form (with a separate character for letters and…
MiniQuark
  • 40,659
  • 30
  • 140
  • 167
574
votes
15 answers

What is the difference between UTF-8 and Unicode?

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page. They are the same thing, aren't they? Can someone clarify?
sarsnake
  • 23,178
  • 58
  • 166
  • 281
1
2 3
99 100