How to exclude regex matches containing a constant string

Question

I need help understanding exclusions in regex.

I begin with this in my Jupyter notebook:

import re

file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()

Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.

I was told this would return only those that are not .gov:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*.[^gov]
''', data, re.X|re.I)

It doesn't. It returns all the emails and excludes any characters in 'gov' following the '@'; e.g.:

abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted

I've tried using ?! in various forms I found online to no avail.

For example, I was told the following syntax would exclude the entire match rather than just those characters:

#re.findall(r'''
#    ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)

Yet the following simply returns an empty list:

#re.findall(r'''
#    ^/(?!\b[-+.\w\d]*@[-+.\w\d]*.gov)([-+.\w\d]*@[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)

I tried to use the advice from this question:

Regular expression to match a line that doesn't contain a word

re.findall(r'''

    [-+.\w\d]*\b@[-+\w\d]*./^((?!.gov).)*$/s  # based on syntax /^((?!**SUBSTRING**).)*$/s
                          #^ this slash is where different code starts
''', data, re.X|re.I)

This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*./(?s)^((?!.gov).)*$/  # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)

And this returns an empty list:

re.findall(r'''
    [-+.\w\d]*\b@[-+\w\d]*.(?s)^((?!.gov).)*$  # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)

Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.

Thanks!!

Perhaps you could use `[^\s@]+@(?:(?!\.gov)\S)+(?!\S)` https://regex101.com/r/FKBocM/1 — The fourth bird, Dec 01 '19 at 11:38

score 1 · Answer 1 · answered Dec 01 '19 at 12:07

A few notes about the patterns you tried

This part of the pattern [-+.\w\d]*\b@ can be shortened to [-+.\w]*\b@ as \w also matches \d and note that it will also not match a dot
Using [-+.\w\d]*\b@ will prevent a dash from matching before the @ but it could match ---a@.a
The character class [-+.\w\d]* is repeated 0+ times but it can never match 0+ times as the word boundary \b will not work between a whitespace or start of line and an @
Note that not escaping the dot . will match any character except a newline
This part ^((?!.gov).)*$ is a tempered greedy token that will, from the start of the string, match any char except a newline asserting what is on the right is not any char except a newline followed by gov until the end of the string

One option could be to use the tempered greedy token to assert that after the @ there is not .gov present.

[-+.\w]+\b@(?:(?!\.gov)\S)+(?!\S)

Explanation about the separate parts

[-+.\w]+ Match 1+ times any of the listed
\b@ Word boundary and match @
(?: Non capturing group
- (?! Negative lookahead, assert what is on the right is not
  - \.gov Match .gov
- ) Close lookahead
- \S Match a non whitespace char
)+ Close non capturing group and repeat 1+ times
(?!\S) Negative lookahead, assert what is on the right is non a non whitespace char to prevent partial matches

Regex demo

You could make the pattern a bit broader by matching not an @ or whitespace char, then match @ and then match non whitespace chars where the string .gov is not present:

[^\s@]+@(?:(?!\.gov)\S)+(?!\S)

Regex demo

- Feel free to [mark the answer](https://stackoverflow.com/tour) as accepted if it helped solving you problem by clicking ✓ on the left of this answer. Note that you get 2 [reputation points](https://stackoverflow.com/help/whats-reputation) accepting a solution. — The fourth bird, Dec 04 '19 at 07:16
@PeterCharland You can upvote answers pressing the up arrow on the left if you have 15 experience points, which you have now :) — The fourth bird, Dec 04 '19 at 07:53

Booboo · Accepted Answer · 2019-12-01T13:39:38.393

First, your regex for recognizing an email address does not look close to being correct. For example, it would accept @13a as being valid. See How to check for valid email address? for some simplifications. I will use: [^@]+@[^@]+\.[^@]+ with the recommendation that we also exclude space characters and so, in your particular case:

^([^@\s]+@[^@\s]+\.[^@\s.]+)

I also added a . to the last character class [^@\s.]+ to ensure that this represents the top-level domain. But we do not want the email address to end in .gov. Our regex specifies toward the end for matching the top-level domain:

\. Match a period.
[^@\s.]+ Match one or more non-white space, non-period characters.

In Step 2 above we should first apply a negative lookahead, i.e. a condition to ensure that the next characters are not gov. But to ensure we are not doing a partial match (if the top-level domain were government, that would be OK), gov must be followed by either white space or the end of the line to be disqualifying. So we have:

^([^@\s]+@[^@\s]+\.(?!gov(?:\s|$))[^@\s.]+)

See Regex Demo

import re

text = """abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted
test@test.gov
test.test@test.org.gov.test
"""

print(re.findall(r'^([^@\s]+@[^@\s]+\.(?!gov(?:\s|$))[^@\s.]+)', text, flags=re.M|re.I))

Prints:

['abc123@abc.c', '456@email.edu', 'test.test@test.org.gov.test']

So, in my interpretation of the problem test.test@test.org.gov.test is OK becuase gov is not the top-level domain. governmentemail@governmentaddress. is rejected simply because it is not a valid email address.

If you don't want gov in any level of the domain, then use this regex:

^([^@\s]+@(?!(?:\S*\.)?gov(?:\s|\.|$))[^@\s]+\.[^@\s]+)

See Regex Demo

After seeing the @ symbol, this ensures that what follows is not an optional period followed by gov followed by either another period, white space character or end of line.

import re

text = """abc123@abc.c     # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress.      #'.gov' omitted
test@test.gov
test.test@test.org.gov.test
"""

print(re.findall(r'^([^@\s]+@(?!(?:\S*\.)?gov(?:\s|\.|$))[^@\s]+\.[^@\s]+)', text, flags=re.M|re.I))

Prints:

['abc123@abc.c', '456@email.edu']

Awesome! Very helpful. Thanks. Now, I assume you ended it with + instead of * only because a email domain with a single character would be absurd? — Peter Charland, Dec 04 '19 at 07:13
@PeterCharland If you are referring to the final `\.[^@\s]+)`, that does allow a single-character top-level domain, absurd or not. An `*` would allow an an empty top-level domain. You need `\.[^@\s]{2,})` for a minimum of two characters for the top-level domain. — Booboo, Dec 04 '19 at 10:53

How to exclude regex matches containing a constant string

2 Answers2