0

I have this code and I want to remove the non-alphanumeric characters. The problem is it removes the Arabic words as well. How can i keep Arabic characters and remove just the non alphanumeric characters.

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(r'\W+', '', hello)

It outputs empty string.

But I want this:

"سلام"
Mohammad Yusuf
  • 13,560
  • 7
  • 38
  • 68
Charif DZ
  • 13,500
  • 3
  • 13
  • 36

3 Answers3

2

This happens because the arabic character is not a "word" character in the traditional sense...

see here

the relevant text:

"\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]"

...

"The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s]."

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(ur'[^\w^\u0600-\u06FF]', '', hello)
deweyredman
  • 1,400
  • 1
  • 8
  • 12
2

Edit: I realized there is a simpler answer. Just turn unicode mode on.

re.sub(r'\W', '', hello, flags=re.UNICODE)

In Python 3 this flag is unnecessary because of how Python 3 handles unicode strings. See https://stackoverflow.com/a/393915/691859 for more information.


(Old answer)

You need to define the character class that you actually want to keep. Since you're dealing with unicode characters you will want to construct a character class that includes your characters... I'm no unicode expert and I also can't read Arabic, but let's go with what wikipedia says is the Arabic unicode block which is U-0600 to U-06FF.

>>> re.sub(ur'[^\u0600-\u06FF]', '', hello)
u'\u0633\u0644\u0627\u0645'

The secret sauce is to make your regex itself also a unicode string, so you can put in the unicode escape sequences for the Arabic unicode block.

As others pointed out, \W means [^\w] which encapsulates the Arabic block. If you want everything but Arabic and latin alphanumeric characters, you can use [^\w\u0600-\u06FF].

  • [] means character class.
  • ^ means everything but what you're about to put in the class.
  • \w means A-Z, a-z, _, and 0-9.
  • \u0600 is the unicode escape for the first character in the Arabic unicode block.
  • - means "everything from to "
  • \u06FF is the unicode escape for the last character in the Arabic unicode block.
Community
  • 1
  • 1
2rs2ts
  • 9,020
  • 6
  • 44
  • 80
  • i think the range is \u0600 to \u06FF, no? – deweyredman Jan 06 '17 at 19:48
  • @deweyredman good point, I linked to the wrong article. I was looking at one that said the range for just the basic Arabic characters was U-0600 to U-0650, but chose the link that had the full table. I'll edit my "old" answer so that it isn't wrong. – 2rs2ts Jan 06 '17 at 20:22
2

I had the same problem till I found this jquery solution ,

function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}

I wanted to make a slug generator who respects Arabic characters , the idea is to identify the arabic characters in the regular expression so this is the final result , hope it helps :

// slug creation
$(document).ready(function(){
  $("#name").change(function(){
  $postTitle = document.getElementById("name").value;
  $slugTitle = slugify($postTitle);
  document.getElementById("slug").value = $slugTitle;
  });
});


function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}
Ahmed Osama
  • 262
  • 4
  • 11