1

I have a Farsi word that if shown in UTF-8 coding is like this:

"خطاب"

I have two versions of this word, both in Notepad++ in UTF-8 are shown as above. But if I look at them in ANSI mode then I see:

ïºïºŽï»„ﺧ

and for the other one I see:

خطاب    

How come the same words have such a different representation in ANSI format? When I use PIL in Python to draw these, the result is correct for one of these and not correct for the other.

I appreciate any help on this.

TJ1
  • 5,601
  • 17
  • 61
  • 101
  • 1
    It [depends on your system settings](http://stackoverflow.com/a/701920/847349). ANSI might not include the Farsi code page – Dmitry Ledentsov Dec 12 '13 at 06:41
  • 2
    If you are interpreting a UTF-8 encoded file in an ANSI encoding, of course you'll see garbage characters. It's not about them "having different representations", it's about interpreting a file in an incorrect encoding. See [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/). – deceze Dec 12 '13 at 08:02
  • @deceze thank you very much for the link, I will read it for sure. However, even in ANSI encoding I should see garbage, shouldn't both of them show the same garbage? – TJ1 Dec 12 '13 at 14:03

1 Answers1

2

In Unicode you can represent some character in more than one way. In this case, these Arabic characters are represented with code points from the Arabic Presentation Forms-B Block in the first case, and with code points from the regular Arabic Block in the second case.

If you convert the text

ïºïºŽï»„ﺧ

to a byte stream, you get

EFBA0F EFBA8E EFBB84 EFBAA7

Notice that you are not seeing a character representing the 0F byte in the text above, because it's a non-visual character.

Now that byte stream is representing a UTF-8-encoded text. Decoding it will give you the following Unicode code points:

FE8F FE8E FEC4 FEA7

You can match those in the Arabic Presentation Forms-B Block to form your Farsi text:

خطاب

You can do the same process for the other text: خطاب gives you the byte stream D8AE D8B7 D8A7 D8A8, which represents UTF-8-encoded text, which decoded gives you the Unicode code points 062e 0637 0627 0628, which matched to the regular Arabic Block gives you again the text خطاب.

jedivader
  • 788
  • 9
  • 19