Read an Arabic File in Python

Question

I'm working with an Arabic text file which is a corpus.

What should I do to be able to import the file in python so I can easily access the file and be able to analyze it instead of copying and pasting the content in the interpreter every time. It's an Arabic file, not English.

Thanks I just did, but now how do I access the file content? I tried this `text = file.readlines(); print text` but the output was an empty list [ ], I want to be able to call functions like `text.split()` etc, What do you think is the problem ? — Tarek Mostafa, Oct 26 '15 at 23:10
`read()` actually returns the file contents as a string. `text = file.read()` — Hassan, Oct 26 '15 at 23:11
BTW, I normally don't complain about this, but you should have researched this and tried something before asking here, especially with something so common. — Hassan, Oct 26 '15 at 23:13
http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python — jgritty, Oct 26 '15 at 23:17

score 14 · Answer 1 · 2015-10-28T11:16:07.147

The most important thing when reading and writing plain text is to know and specify the plain text encoding. You shouldn't let Python guesses the encoding for you, especially in real world program (The encoding should be either configurable or you ask the user for the encoding).

Many people don't have an issue with English text because ASCII is a subset of most encodings. The issue is there and they will run into it as soon as the program tries to read or write texts in different encodings.

Most Arabic texts are encoded in (ordered by popularity¹) Windows-1256, UTF-8, CP720, or ISO 8859-6. You should know ahead of time what encoding your plain text is using, for example when most text editors allow you to select the encoding when you save the file.

I have three files with your name طارق but in 3 different encodings. Reading the files as raw binary data show you how different these files are, though it's the the same text:

>>> f = open('file-utf8.txt', 'rb')
>>> f.read()
b'\xd8\xb7\xd8\xa7\xd8\xb1\xd9\x82'
>>>
>>> f = open('file-cp720.txt', 'rb')
>>> f.read()
b'\xe1\x9f\xa9\xe7'
>>>
>>> f = open('file-windows1256.txt', 'rb')
>>> f.read()
b'\xd8\xc7\xd1\xde'
>>>

The right way to read these files is by telling Python what encoding it should use so it decodes it to its internal Unicode representation (Using the mapping tables in /Python33/Lib/encodings/):

>>> f = open('file-utf8.txt', encoding='utf-8')
>>> f.read()
'طارق'
>>>
>>> f = open('file-cp720.txt', encoding='cp720')
>>> f.read()
'طارق'
>>>
>>> f = open('file-windows1256.txt', encoding='windows-1256')
>>> f.read()
'طارق'
>>>

The issue of encoding is not only related to files. Whenever you read texts from external source to the program, e.g. file, console, network socket, you must know the encoding. Also when you write to external source you have to encode the text to the right encoding.

The encoding have to be consistent, if your console is using Latin-1 and you tried to write to the console, i.e. print, you will get some meaningless word or, if you are lucky, you will get UnicodeEncodeError exception.

There are ways for guessing the encoding, but I won't bother to use them as they only mask the problem. It will come sooner or later.

¹ If it's up to you, always go with UTF-8 because it's well supported.

(1) note: if you don't pass the explicit character encoding then `locale.getpreferredencoding(False)` is used (likely, it is equivalent to `cp1256` on Windows) (2) "meaningles word" is called "mojibake" sometimes (3) Some interfaces are Unicode e.g., you could [print Unicode to Windows console directly regardless of `chcp` (install `win-unicode-console`)](http://stackoverflow.com/a/30551552/4279) — jfs, Oct 29 '15 at 13:59

Moniba · Answer 2 · 2020-06-17T16:01:02.310

1

The right encodings to read an Arabic text file are utf_8 and utf_16. But you have to try both and see which one is the right encoding for your file. You can do this by using the codecs package and setting the right encoding argument.

import codecs, sys
# pass your file as a command-line argument 
# try "utf-16" encoding if this does not work 



for line in codecs.open(sys.argv[1], encoding = "utf_8"):
    print(line.strip()

edited Jun 17 '20 at 16:01

answered Jun 08 '20 at 15:50

Moniba

519
6
15

2

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – Nic3500 Jun 09 '20 at 04:57

score 0 · Answer 3 · answered Oct 26 '15 at 23:26

Arabic is generally represented in Unicode.

Generally, you can read the file in and then convert to Unicode:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

For more information, refer to https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

score 0 · Answer 4 · edited Apr 20 '20 at 11:33

0

File = open("Infixes.txt",encoding = "utf-16")
print(File.read())

This works for me in Windows 8.1 and 64 bit processor

edited Apr 20 '20 at 11:33

David Buck

3,439
29
24
31

answered Apr 20 '20 at 11:17

Burhan Atique

1
1

score -1 · Answer 5 · edited Jul 23 '18 at 05:01

-1

Use this for Urdu file reading in Python:

File = open("Infixes.txt",encoding = "utf-8")
print(File.read())

edited Jul 23 '18 at 05:01

petezurich

6,779
8
29
46

answered Jul 23 '18 at 04:52

AJk _FLY

1
1

Welcome to SO. Please learn how to format your postings. Thanks! – petezurich Jul 23 '18 at 05:00

Read an Arabic File in Python

5 Answers5