Edit: I realized there is a simpler answer. Just turn unicode mode on.
re.sub(r'\W', '', hello, flags=re.UNICODE)
In Python 3 this flag is unnecessary because of how Python 3 handles unicode strings. See https://stackoverflow.com/a/393915/691859 for more information.
(Old answer)
You need to define the character class that you actually want to keep. Since you're dealing with unicode characters you will want to construct a character class that includes your characters... I'm no unicode expert and I also can't read Arabic, but let's go with what wikipedia says is the Arabic unicode block which is U-0600 to U-06FF.
>>> re.sub(ur'[^\u0600-\u06FF]', '', hello)
u'\u0633\u0644\u0627\u0645'
The secret sauce is to make your regex itself also a unicode string, so you can put in the unicode escape sequences for the Arabic unicode block.
As others pointed out, \W
means [^\w]
which encapsulates the Arabic block. If you want everything but Arabic and latin alphanumeric characters, you can use [^\w\u0600-\u06FF]
.
[]
means character class.
^
means everything but what you're about to put in the class.
\w
means A-Z, a-z, _, and 0-9.
\u0600
is the unicode escape for the first character in the Arabic unicode block.
-
means "everything from to "
\u06FF
is the unicode escape for the last character in the Arabic unicode block.