11

I'm working with an Arabic text file which is a corpus.

What should I do to be able to import the file in python so I can easily access the file and be able to analyze it instead of copying and pasting the content in the interpreter every time. It's an Arabic file, not English.

Casimir Crystal
  • 18,651
  • 14
  • 55
  • 76
Tarek Mostafa
  • 348
  • 2
  • 4
  • 15

5 Answers5

14

The most important thing when reading and writing plain text is to know and specify the plain text encoding. You shouldn't let Python guesses the encoding for you, especially in real world program (The encoding should be either configurable or you ask the user for the encoding).

Many people don't have an issue with English text because ASCII is a subset of most encodings. The issue is there and they will run into it as soon as the program tries to read or write texts in different encodings.

Most Arabic texts are encoded in (ordered by popularity1) Windows-1256, UTF-8, CP720, or ISO 8859-6. You should know ahead of time what encoding your plain text is using, for example when most text editors allow you to select the encoding when you save the file.

I have three files with your name طارق but in 3 different encodings. Reading the files as raw binary data show you how different these files are, though it's the the same text:

>>> f = open('file-utf8.txt', 'rb')
>>> f.read()
b'\xd8\xb7\xd8\xa7\xd8\xb1\xd9\x82'
>>>
>>> f = open('file-cp720.txt', 'rb')
>>> f.read()
b'\xe1\x9f\xa9\xe7'
>>>
>>> f = open('file-windows1256.txt', 'rb')
>>> f.read()
b'\xd8\xc7\xd1\xde'
>>>

The right way to read these files is by telling Python what encoding it should use so it decodes it to its internal Unicode representation (Using the mapping tables in /Python33/Lib/encodings/):

>>> f = open('file-utf8.txt', encoding='utf-8')
>>> f.read()
'طارق'
>>>
>>> f = open('file-cp720.txt', encoding='cp720')
>>> f.read()
'طارق'
>>>
>>> f = open('file-windows1256.txt', encoding='windows-1256')
>>> f.read()
'طارق'
>>>

The issue of encoding is not only related to files. Whenever you read texts from external source to the program, e.g. file, console, network socket, you must know the encoding. Also when you write to external source you have to encode the text to the right encoding.

The encoding have to be consistent, if your console is using Latin-1 and you tried to write to the console, i.e. print, you will get some meaningless word or, if you are lucky, you will get UnicodeEncodeError exception.

There are ways for guessing the encoding, but I won't bother to use them as they only mask the problem. It will come sooner or later.

1 If it's up to you, always go with UTF-8 because it's well supported.

  • 2
    (1) note: if you don't pass the explicit character encoding then `locale.getpreferredencoding(False)` is used (likely, it is equivalent to `cp1256` on Windows) (2) "meaningles word" is called "mojibake" sometimes (3) Some interfaces are Unicode e.g., you could [print Unicode to Windows console directly regardless of `chcp` (install `win-unicode-console`)](http://stackoverflow.com/a/30551552/4279) – jfs Oct 29 '15 at 13:59
  • 1
    in python2 use `codecs.open('file', encoding='utf-8')` – mohammedgqudah Dec 07 '17 at 18:46
1

The right encodings to read an Arabic text file are utf_8 and utf_16. But you have to try both and see which one is the right encoding for your file. You can do this by using the codecs package and setting the right encoding argument.

import codecs, sys
# pass your file as a command-line argument 
# try "utf-16" encoding if this does not work 



for line in codecs.open(sys.argv[1], encoding = "utf_8"):
    print(line.strip()
Moniba
  • 519
  • 6
  • 15
  • 2
    While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – Nic3500 Jun 09 '20 at 04:57
0

Arabic is generally represented in Unicode.

Generally, you can read the file in and then convert to Unicode:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

For more information, refer to https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

stanleyli
  • 1,307
  • 8
  • 27
0
File = open("Infixes.txt",encoding = "utf-16")
print(File.read())

This works for me in Windows 8.1 and 64 bit processor

David Buck
  • 3,439
  • 29
  • 24
  • 31
-1

Use this for Urdu file reading in Python:

File = open("Infixes.txt",encoding = "utf-8")
print(File.read())
petezurich
  • 6,779
  • 8
  • 29
  • 46
AJk _FLY
  • 1
  • 1