I've created a script in python
using regular expression
to parse emails from few websites. The pattern that I've used to grab email is \w+@\w+\.{1}\w+
which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress
, Slice_1@2x.png
e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+@\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc@gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1@2x.png')
('http://www.palstudiotheatre.com/', 'theatre@palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc@gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre@palvancouver.org')
How can I get rid of unwanted items using regex?