1

I need help understanding exclusions in regex.

I begin with this in my Jupyter notebook:

import re

file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()

Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.

I was told this would return only those that are not .gov:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*.[^gov]
''', data, re.X|re.I)

It doesn't. It returns all the emails and excludes any characters in 'gov' following the '@'; e.g.:

abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted

I've tried using ?! in various forms I found online to no avail.

For example, I was told the following syntax would exclude the entire match rather than just those characters:

#re.findall(r'''
#    ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)

Yet the following simply returns an empty list:

#re.findall(r'''
#    ^/(?!\b[-+.\w\d]*@[-+.\w\d]*.gov)([-+.\w\d]*@[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)

I tried to use the advice from this question:

Regular expression to match a line that doesn't contain a word

re.findall(r'''

    [-+.\w\d]*\b@[-+\w\d]*./^((?!.gov).)*$/s  # based on syntax /^((?!**SUBSTRING**).)*$/s
                          #^ this slash is where different code starts
''', data, re.X|re.I)

This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*./(?s)^((?!.gov).)*$/  # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)

And this returns an empty list:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*.(?s)^((?!.gov).)*$  # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)

Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.

Thanks!!

Peter Charland
  • 181
  • 1
  • 13

2 Answers2

1

A few notes about the patterns you tried

  • This part of the pattern [-+.\w\d]*\b@ can be shortened to [-+.\w]*\b@ as \w also matches \d and note that it will also not match a dot

  • Using [-+.\w\d]*\b@ will prevent a dash from matching before the @ but it could match ---a@.a

  • The character class [-+.\w\d]* is repeated 0+ times but it can never match 0+ times as the word boundary \b will not work between a whitespace or start of line and an @

  • Note that not escaping the dot . will match any character except a newline

  • This part ^((?!.gov).)*$ is a tempered greedy token that will, from the start of the string, match any char except a newline asserting what is on the right is not any char except a newline followed by gov until the end of the string

One option could be to use the tempered greedy token to assert that after the @ there is not .gov present.

[-+.\w]+\b@(?:(?!\.gov)\S)+(?!\S)

Explanation about the separate parts

  • [-+.\w]+ Match 1+ times any of the listed
  • \b@ Word boundary and match @
  • (?: Non capturing group
    • (?! Negative lookahead, assert what is on the right is not
      • \.gov Match .gov
    • ) Close lookahead
    • \S Match a non whitespace char
  • )+ Close non capturing group and repeat 1+ times
  • (?!\S) Negative lookahead, assert what is on the right is non a non whitespace char to prevent partial matches

Regex demo


You could make the pattern a bit broader by matching not an @ or whitespace char, then match @ and then match non whitespace chars where the string .gov is not present:

[^\s@]+@(?:(?!\.gov)\S)+(?!\S)

Regex demo

The fourth bird
  • 96,715
  • 14
  • 35
  • 52
1

First, your regex for recognizing an email address does not look close to being correct. For example, it would accept @13a as being valid. See How to check for valid email address? for some simplifications. I will use: [^@]+@[^@]+\.[^@]+ with the recommendation that we also exclude space characters and so, in your particular case:

^([^@\s]+@[^@\s]+\.[^@\s.]+)

I also added a . to the last character class [^@\s.]+ to ensure that this represents the top-level domain. But we do not want the email address to end in .gov. Our regex specifies toward the end for matching the top-level domain:

  1. \. Match a period.
  2. [^@\s.]+ Match one or more non-white space, non-period characters.

In Step 2 above we should first apply a negative lookahead, i.e. a condition to ensure that the next characters are not gov. But to ensure we are not doing a partial match (if the top-level domain were government, that would be OK), gov must be followed by either white space or the end of the line to be disqualifying. So we have:

^([^@\s]+@[^@\s]+\.(?!gov(?:\s|$))[^@\s.]+)

See Regex Demo

import re

text = """abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted
test@test.gov
test.test@test.org.gov.test
"""

print(re.findall(r'^([^@\s]+@[^@\s]+\.(?!gov(?:\s|$))[^@\s.]+)', text, flags=re.M|re.I))

Prints:

['abc123@abc.c', '456@email.edu', 'test.test@test.org.gov.test']

So, in my interpretation of the problem test.test@test.org.gov.test is OK becuase gov is not the top-level domain. governmentemail@governmentaddress. is rejected simply because it is not a valid email address.

If you don't want gov in any level of the domain, then use this regex:

^([^@\s]+@(?!(?:\S*\.)?gov(?:\s|\.|$))[^@\s]+\.[^@\s]+)

See Regex Demo

After seeing the @ symbol, this ensures that what follows is not an optional period followed by gov followed by either another period, white space character or end of line.

import re

text = """abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted
test@test.gov
test.test@test.org.gov.test
"""

print(re.findall(r'^([^@\s]+@(?!(?:\S*\.)?gov(?:\s|\.|$))[^@\s]+\.[^@\s]+)', text, flags=re.M|re.I))

Prints:

['abc123@abc.c', '456@email.edu']
Booboo
  • 18,421
  • 2
  • 23
  • 40
  • Awesome! Very helpful. Thanks. Now, I assume you ended it with + instead of * only because a email domain with a single character would be absurd? – Peter Charland Dec 04 '19 at 07:13
  • 1
    @PeterCharland If you are referring to the final `\.[^@\s]+)`, that does allow a single-character top-level domain, absurd or not. An `*` would allow an an empty top-level domain. You need `\.[^@\s]{2,})` for a minimum of two characters for the top-level domain. – Booboo Dec 04 '19 at 10:53
  • ah yes I see. Great! Makes sense. – Peter Charland Dec 06 '19 at 05:30