I need help understanding exclusions in regex.
I begin with this in my Jupyter notebook:
import re
file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()
Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.
I was told this would return only those that are not .gov:
re.findall(r'''
[-+.\w\d]*\b@[-+\w\d]*.[^gov]
''', data, re.X|re.I)
It doesn't. It returns all the emails and excludes any characters in 'gov' following the '@'; e.g.:
abc123@abc.c # 'o' is in 'gov' so it ends the returned string there
456@email.edu
governmentemail@governmentaddress. #'.gov' omitted
I've tried using ?! in various forms I found online to no avail.
For example, I was told the following syntax would exclude the entire match rather than just those characters:
#re.findall(r'''
# ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)
Yet the following simply returns an empty list:
#re.findall(r'''
# ^/(?!\b[-+.\w\d]*@[-+.\w\d]*.gov)([-+.\w\d]*@[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)
I tried to use the advice from this question:
Regular expression to match a line that doesn't contain a word
re.findall(r'''
[-+.\w\d]*\b@[-+\w\d]*./^((?!.gov).)*$/s # based on syntax /^((?!**SUBSTRING**).)*$/s
#^ this slash is where different code starts
''', data, re.X|re.I)
This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:
re.findall(r'''
[-+.\w\d]*\b@[-+\w\d]*./(?s)^((?!.gov).)*$/ # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)
And this returns an empty list:
re.findall(r'''
[-+.\w\d]*\b@[-+\w\d]*.(?s)^((?!.gov).)*$ # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)
Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.
Thanks!!