Two word with the same representation in UTF-8 have different representation in ASCII

Question

I have a Farsi word that if shown in UTF-8 coding is like this:

"خطاب"

I have two versions of this word, both in Notepad++ in UTF-8 are shown as above. But if I look at them in ANSI mode then I see:

ïºïºŽï»„ïº§

and for the other one I see:

Ø®Ø·Ø§Ø¨

How come the same words have such a different representation in ANSI format? When I use PIL in Python to draw these, the result is correct for one of these and not correct for the other.

I appreciate any help on this.

It [depends on your system settings](http://stackoverflow.com/a/701920/847349). ANSI might not include the Farsi code page — Dmitry Ledentsov, Dec 12 '13 at 06:41
If you are interpreting a UTF-8 encoded file in an ANSI encoding, of course you'll see garbage characters. It's not about them "having different representations", it's about interpreting a file in an incorrect encoding. See [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/). — deceze, Dec 12 '13 at 08:02
@deceze thank you very much for the link, I will read it for sure. However, even in ANSI encoding I should see garbage, shouldn't both of them show the same garbage? — TJ1, Dec 12 '13 at 14:03

jedivader · Answer 1 · 2014-03-03T01:03:54.453

In Unicode you can represent some character in more than one way. In this case, these Arabic characters are represented with code points from the Arabic Presentation Forms-B Block in the first case, and with code points from the regular Arabic Block in the second case.

If you convert the text

ïºïºŽï»„ïº§

to a byte stream, you get

EFBA0F EFBA8E EFBB84 EFBAA7

Notice that you are not seeing a character representing the 0F byte in the text above, because it's a non-visual character.

Now that byte stream is representing a UTF-8-encoded text. Decoding it will give you the following Unicode code points:

FE8F FE8E FEC4 FEA7

You can match those in the Arabic Presentation Forms-B Block to form your Farsi text:

خطاب

You can do the same process for the other text: Ø®Ø·Ø§Ø¨ gives you the byte stream D8AE D8B7 D8A7 D8A8, which represents UTF-8-encoded text, which decoded gives you the Unicode code points 062e 0637 0627 0628, which matched to the regular Arabic Block gives you again the text خطاب.

Two word with the same representation in UTF-8 have different representation in ASCII

1 Answers1