1

I am learning regex in python. Meanwhile, on a stage, I produced the first regex statement and my tutorial says the second. Both produce the same result for the given string. What are the differences? What may be the string for, that these codes will produce different results?

>>> f = 'From m.rubayet94@gmail.com sat Jan'
>>> y = re.findall('^From .*@(\S+)',f); print(y)
['gmail.com']
>>> y = re.findall('^From .*@([^ ]*)',f); print(y)
['gmail.com']
  • 1
    What's your specific question? Do you know what `[^ ]` means? What `\S` means? What `+` and `*` mean? Those basic building blocks should be enough for you to answer your own question. – Chris Dec 21 '19 at 03:12
  • 1
    Hi @Rubayet Mahmud, if you are new to regex here is a tool to help you parse it out. You can even test your inputs on it [online-regex-tool](https://regex101.com/). Try to keep your question more focused next time! Hope it helps! – Yacine Mahdid Dec 21 '19 at 03:13
  • `\S` mean not whitespace. This is different than `[^ ]` which means not a space. Try with the string `f = 'From m.rubayet94@gma\til.com sat Jan'` for instance. – Mark Dec 21 '19 at 03:15
  • Hi, so `\S` is a character class that matches any non-whitespace character. In the second case, `[^ ]` matches any character that is not a single space (ASCII code 32). The difference is that whitespace is defined as including many other characters such as tab, line feed, carriage return, and more. – IronMan Dec 21 '19 at 03:16

2 Answers2

2

[^ ]* means zero or more non-space characters.

\S+ means one or more non-whitespace characters.

It looks like you're aiming to match a domain name which may be part of an email address, so the second regex is the better choice between the two since domain names can't contain any whitespace like tabs \t and newlines \n, beyond just spaces. (Domain names can't contain other characters too, but that's beside the point.)

Here are some examples of the differences:

import re

p1 = re.compile(r'^From .*@([^ ]*)')
p2 = re.compile(r'^From .*@(\S+)')

for s in ['From eric@domain\nTo john@domain', 'From graham@']:
    print(p1.findall(s), p2.findall(s))

In the first case, whitespace isn't handled properly: ['domain\nTo'] ['domain']

In the second case, you get a null match where you shouldn't: [''] []

Toto
  • 83,193
  • 59
  • 77
  • 109
wjandrea
  • 16,334
  • 5
  • 30
  • 53
1

One of the regexes uses [^ ] while the other uses (\S+). I assume that at that point you're trying to match against anything but a whitespace.

The difference between both expressions is that (\S+) will match against anything that isn't any whitespace chracters (whitespace characteres are [ \t\n\r\f\v], you can read more here). [^ ] will match against anything that isn't a single whitespace character (i.e. a whitespace produced by pressing the spacebar).

Ismael Padilla
  • 4,051
  • 2
  • 17
  • 29