Combining regular expressions in Python - \W and \S

Question

I want my code to only return the special characters [".", "*", "=", ","] I want to remove all digits/alphabetical characters ("\W") and all white spaces ("\S")

import re

original_string = "John is happy. He owns 3*4=12, apples"
new_string = re.findall("\W\S",original_string)
print(new_string)

But instead I get this as my output: [' i', ' h', ' H', ' o', ' 3', '*4', '=1', ' a']

I have absolutely no idea why this happens. Hence I have two questions:

1) Is it possible to achieve my goal using regular expressions

2) What is actually going on with my code?

@Toto I am all for closing questions as duplicates but I am a little doubtful as to the usefulness of such a broad post as a target in this case. If you can find a more specific post, that would be great. — cs95, May 19 '19 at 17:05
@cs95: It explains what is the difference between `\w` & `\W` and `\s` & `\S` in the 3rd paragraph: "Character Classes" — Toto, May 19 '19 at 17:09
@cs95: I understand your point of view, just reopen the question. — Toto, May 19 '19 at 18:27

score 3 · Accepted Answer · answered May 19 '19 at 16:47

3

You were close, but you need to specify these escape sequences inside a character class.

re.findall(r'[^\w\s]', original_string)
# ['.', '*', '=', ',']

Note that the caret ^ indicates negation (i.e., don't match these characters).

Alternatively, instead of removing what you don't need, why not extract what you do?

re.findall(r'[.*=,]', original_string)
# ['.', '*', '=', ',']

answered May 19 '19 at 16:47

cs95

274,032
76
480
537

What's with the negation? `[\W\S]` is a lot more obvious given the OP's requirements. – tripleee May 19 '19 at 16:54
@tripleee `re.findall(r'[\W\S]', original_string)` didn't work when I tried it on python3.6 and I just assumed it was incorrect. Maybe there's a trick to it... – cs95 May 19 '19 at 16:55
@tripleee Here's my take: `[^\w\s]` is the same as NOT (whitespace or alnum) versus `[\W\S]` which is (NOT whitespace or NOT alnum) which, if you know DeMorgan's Law isn't the same thing... could be wrong. – cs95 May 19 '19 at 16:58
The first expression should have "and" instead of "or", which is in fact equivalent. – tripleee May 19 '19 at 17:03
@tripleee Hmm, I've believed character classes always imply OR, or am I mistaken? Either way, you'll see `[\W\S]` does not work out of the box. – cs95 May 19 '19 at 17:04
I was curious too why this didn't work but it clearly doesn't! – EML May 19 '19 at 17:06
The caret negates all the characters in the class at the same time; so i's really "not(`whitespace or alnum)" i.e. "(not whitespace) and (not alnum)". – tripleee May 19 '19 at 17:07
@tripleee Great! So, we agree with each other? The second case I mentioned would mean the classes are negated separately, so you have NOT x OR NOT y which is different from NOT x AND NOT y in the first case :p – cs95 May 19 '19 at 17:08
1

@EML This is likely an obvious detail we are discussing, but I have been unable to find any documentation specifically discussing the difference. – cs95 May 19 '19 at 17:09
1

... I stand corrected, the two are not equivalent. Sorry for being dense. – tripleee May 19 '19 at 17:16

Emma · Answer 2 · 2019-05-19T23:14:30.060

Here, we can also add our desired special chars in a [], swipe everything else, and then collect only those chars:

([\s\S].*?)([.*=,])?

Python Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([\s\S].*?)([.*=,])?"

test_str = "John is happy. He owns 3*4=12, apples"

subst = "\\2"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

JavaScript Demo

const regex = /([\s\S].*?)([.*=,])?/gm;
const str = `John is happy. He owns 3*4=12, apples`;
const subst = `$2`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

RegEx

If this wasn't our desired expression, we can modify/change it in regex101.com.

RegEx Circuit

We can also visualize expressions in jex.im:

Demo

Why are you digressing into JavaScript? – tripleee May 19 '19 at 17:02 — tripleee, May 19 '19 at 17:02

tripleee · Answer 3 · 2019-05-19T17:21:24.210

The regular expression \W\S matches a sequence of two characters; one non-word, and one non-space. If you want to combine them, that's [^\w\s] which matches one character which does not belong to either the word or the whitespace group.

However, there are many characters which are not one of the ones you enumerate which match this expression. If you want to remove characters which are not in your set, the character class containing exactly all those characters is simply [^.*=,]

Perhaps it's worth noting that inside [...] you don't need to (and in fact should not) backslash-escape e.g. the literal dot. By default, a character class cannot match a newline character, though there is an option re.DOTALL to change this.

If you are trying to extract and parse numerical expressions, regex can be a useful part of the lexical analysis, but you really want a proper parser.

Thanks to @cs95 for patiently explaining stuff which should be obvious. — tripleee, May 19 '19 at 17:21