The most important thing when reading and writing plain text is to know and specify the plain text encoding. You shouldn't let Python guesses the encoding for you, especially in real world program (The encoding should be either configurable or you ask the user for the encoding).
Many people don't have an issue with English text because ASCII is a subset of most encodings. The issue is there and they will run into it as soon as the program tries to read or write texts in different encodings.
Most Arabic texts are encoded in (ordered by popularity1) Windows-1256, UTF-8, CP720, or ISO 8859-6. You should know ahead of time what encoding your plain text is using, for example when most text editors allow you to select the encoding when you save the file.
I have three files with your name طارق
but in 3 different encodings. Reading the files as raw binary data show you how different these files are, though it's the the same text:
>>> f = open('file-utf8.txt', 'rb')
>>> f.read()
b'\xd8\xb7\xd8\xa7\xd8\xb1\xd9\x82'
>>>
>>> f = open('file-cp720.txt', 'rb')
>>> f.read()
b'\xe1\x9f\xa9\xe7'
>>>
>>> f = open('file-windows1256.txt', 'rb')
>>> f.read()
b'\xd8\xc7\xd1\xde'
>>>
The right way to read these files is by telling Python what encoding it should use so it decodes it to its internal Unicode representation (Using the mapping tables in /Python33/Lib/encodings/
):
>>> f = open('file-utf8.txt', encoding='utf-8')
>>> f.read()
'طارق'
>>>
>>> f = open('file-cp720.txt', encoding='cp720')
>>> f.read()
'طارق'
>>>
>>> f = open('file-windows1256.txt', encoding='windows-1256')
>>> f.read()
'طارق'
>>>
The issue of encoding is not only related to files. Whenever you read texts from external source to the program, e.g. file, console, network socket, you must know the encoding. Also when you write to external source you have to encode the text to the right encoding.
The encoding have to be consistent, if your console is using Latin-1 and you tried to write to the console, i.e. print, you will get some meaningless word or, if you are lucky, you will get UnicodeEncodeError
exception.
There are ways for guessing the encoding, but I won't bother to use them as they only mask the problem. It will come sooner or later.
1 If it's up to you, always go with UTF-8 because it's well supported.