Remove non-alphanumeric characters by regex substitution

Question

I have this code and I want to remove the non-alphanumeric characters. The problem is it removes the Arabic words as well. How can i keep Arabic characters and remove just the non alphanumeric characters.

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(r'\W+', '', hello)

It outputs empty string.

But I want this:

"سلام"

Note that `\W+` and `\w+` are very different. `\w` is any word character. It is equivalent to `[A-Za-z0-9_]` while `\W` is any non-word character, equivalent to `[^A-Za-z0-9_]`. The `+` means "one or more" — Patrick Haugh, Jan 06 '17 at 18:59
Possible duplicate of [Reference - What does this regex mean?](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) — Patrick Haugh, Jan 06 '17 at 19:00
Edit your question, it's not clear at all. Provide test string, the regex you tried and the final string that you want. — Mohammad Yusuf, Jan 06 '17 at 19:08
it's exactly what i need thank you so mush do you have any idea ?! — Charif DZ, Jan 06 '17 at 19:19
In Python3.4, `re.sub(r'\W+', '', hello)` returns `سلام`. — unutbu, Jan 06 '17 at 19:30

deweyredman · Answer 1 · 2017-01-06T20:09:05.500

2

This happens because the arabic character is not a "word" character in the traditional sense...

see here

the relevant text:

"\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]"

...

"The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s]."

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(ur'[^\w^\u0600-\u06FF]', '', hello)

edited Jan 06 '17 at 20:09

answered Jan 06 '17 at 18:59

deweyredman

1,400
1
8
12

thank you for your answer my problem is that i want to remove the non-alphanumeric chars but keep the arabic word! – Charif DZ Jan 06 '17 at 19:04
do you want to keep all arabic characters? if so, you need to find the first and the last characters you want to represent and create a range – deweyredman Jan 06 '17 at 19:31
yes but i'm not that good in python i'm coming from java how can i define the range ? – Charif DZ Jan 06 '17 at 19:36
I've updated my answer to reflect my previous comment. – deweyredman Jan 06 '17 at 19:37

score 2 · Accepted Answer · edited May 23 '17 at 12:01

Edit: I realized there is a simpler answer. Just turn unicode mode on.

re.sub(r'\W', '', hello, flags=re.UNICODE)

In Python 3 this flag is unnecessary because of how Python 3 handles unicode strings. See https://stackoverflow.com/a/393915/691859 for more information.

(Old answer)

You need to define the character class that you actually want to keep. Since you're dealing with unicode characters you will want to construct a character class that includes your characters... I'm no unicode expert and I also can't read Arabic, but let's go with what wikipedia says is the Arabic unicode block which is U-0600 to U-06FF.

>>> re.sub(ur'[^\u0600-\u06FF]', '', hello)
u'\u0633\u0644\u0627\u0645'

The secret sauce is to make your regex itself also a unicode string, so you can put in the unicode escape sequences for the Arabic unicode block.

As others pointed out, \W means [^\w] which encapsulates the Arabic block. If you want everything but Arabic and latin alphanumeric characters, you can use [^\w\u0600-\u06FF].

[] means character class.
^ means everything but what you're about to put in the class.
\w means A-Z, a-z, _, and 0-9.
\u0600 is the unicode escape for the first character in the Arabic unicode block.
- means "everything from to "
\u06FF is the unicode escape for the last character in the Arabic unicode block.

@deweyredman good point, I linked to the wrong article. I was looking at one that said the range for just the basic Arabic characters was U-0600 to U-0650, but chose the link that had the full table. I'll edit my "old" answer so that it isn't wrong. — 2rs2ts, Jan 06 '17 at 20:22

score 2 · Answer 3 · answered Apr 26 '19 at 17:31

I had the same problem till I found this jquery solution ,

function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}

I wanted to make a slug generator who respects Arabic characters , the idea is to identify the arabic characters in the regular expression so this is the final result , hope it helps :

// slug creation
$(document).ready(function(){
  $("#name").change(function(){
  $postTitle = document.getElementById("name").value;
  $slugTitle = slugify($postTitle);
  document.getElementById("slug").value = $slugTitle;
  });
});


function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}

Remove non-alphanumeric characters by regex substitution

3 Answers3