Read only Arabic text from file in Python

Asked Jun 10 '20 at 17:04

Active Jun 10 '20 at 17:12

Viewed 43 times

I have a text file that includes both English and Arabic words.

How can I extract into a string, only the Arabic text?

Sample file:

<head>
ابتداء القول الأول
</head>
<p>
وأيضاً بعض الحيوان البحرى لجى وبعضه شاطئى، وبعضه صخرى.
</p>

And the desired output is:

only_arabic = "ابتداء القول الأول وأيضاً بعض الحيوان البحرى لجى وبعضه شاطئى، وبعضه صخرى"

I couldn't find regex that helps me with this issue. What am I missing?

edited Jun 10 '20 at 17:05

pppery

asked Jun 10 '20 at 17:04

Yanirmr

2

[Don't use regex to parse html](https://stackoverflow.com/a/1732454/3750257) – pppery Jun 10 '20 at 17:06
If the english text is just html tags you can exclude the html tags by regex like 's/]*>//g' (sed command in linux ) – nourhero Jun 10 '20 at 17:12
https://stackoverflow.com/questions/29406247/how-to-remove-english-text-from-arabic-string-in-python – Yanirmr Jun 10 '20 at 17:20
You can use `unicodedata` module (it is a module in standard library). See https://docs.python.org/3/library/unicodedata.html, and last example: iterate codepoint, and keep Arabic (and possibly punctuations, spaces, numbers, and other symbols within Arabic text). – Giacomo Catenazzi Jun 12 '20 at 08:03

0 Answers0