3

I want to remove words with numbers. After research I understood that

 s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()

This code works to solve my situation

However, I am not able to understand how this code works. I know about regex and I know individually \d recognizes all the numbers [0-9]. \S is for white spaces. and * is 0 or more occurrences of the pattern to its left

"\S*\d\S*"

This part I am not able to understand

But I am not sure I understand how this code identifies AB55.

Can anyone please explain to me? Thanks

asspsss
  • 73
  • 1
  • 7

4 Answers4

1

this replaces a digit with any non-space symbols around with empty string ""

the AB55 is viewed like : AB are \S*, 5 is \d, 5 is \S*

55CD : empty string is \S*, 5 is \d, 5CD is \S*

A55D : A is \S*, 5 is \d, 5D is \S*

5555 : empty string is \S*, 5 is \d, 555 is \S*

The re.sub("\S*\d\S*", "", s) replaces all this substrings to empty string "" and .strip() is useless since it removes whitespace at the begin and end of the previous result

snamef
  • 185
  • 1
  • 10
1

You misunderstand the code. \S is the opposite of \s: it matches with everything except whitespace.

Since the Kleene star (*) is greedy, it thus means that it aims to match as much non-space characters as possible, followed by a digit followed by as much non-space characters as possible. It will thus match a full word, where at least one character is a digit.

All these matches are then replaced by the empty string, and therefore removed from the original string.

Willem Van Onsem
  • 321,217
  • 26
  • 295
  • 405
1

Your code first matches 0+ times non whitespace chars \S* (where \s* matches whitespace chars) and will match all the way until the end of the "word". Then it backtracks to match a digit and and again match 0+ non whitespace chars.

The pattern will for example also match a single digit.

You could slightly optimize the pattern to first match not a whitespace char or a digit [^\s\d]* using a negated character class to prevent the first \S* match the whole word.

[^\s\d]*\d\S*

Regex demo

The fourth bird
  • 96,715
  • 14
  • 35
  • 52
1

This is how your regex works, you mention about \S for white spaces. But it is not.

enter image description here

This is what python documentation mention about \s and \S

\s

Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

This is with \s which is for whitespace characters.

enter image description here

and you'll get an output like this,

>>> import re
>>>
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\s*\d\s*", "", s).strip()
'ABCD abcd ABCD AD'
Community
  • 1
  • 1
Kushan Gunasekera
  • 2,857
  • 2
  • 19
  • 31