0

I have a text file that includes both English and Arabic words.

How can I extract into a string, only the Arabic text?

Sample file:

<head>
ابتداء القول الأول
</head>
<p>
وأيضاً بعض الحيوان البحرى لجى وبعضه شاطئى، وبعضه صخرى.
</p>

And the desired output is:

only_arabic = "ابتداء القول الأول وأيضاً بعض الحيوان البحرى لجى وبعضه شاطئى، وبعضه صخرى"

I couldn't find regex that helps me with this issue. What am I missing?

pppery
  • 3,434
  • 13
  • 24
  • 37
Yanirmr
  • 545
  • 2
  • 16
  • 2
    [Don't use regex to parse html](https://stackoverflow.com/a/1732454/3750257) – pppery Jun 10 '20 at 17:06
  • If the english text is just html tags you can exclude the html tags by regex like 's/]*>//g' (sed command in linux ) – nourhero Jun 10 '20 at 17:12
  • https://stackoverflow.com/questions/29406247/how-to-remove-english-text-from-arabic-string-in-python – Yanirmr Jun 10 '20 at 17:20
  • You can use `unicodedata` module (it is a module in standard library). See https://docs.python.org/3/library/unicodedata.html, and last example: iterate codepoint, and keep Arabic (and possibly punctuations, spaces, numbers, and other symbols within Arabic text). – Giacomo Catenazzi Jun 12 '20 at 08:03

0 Answers0