-1

I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+@\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress, Slice_1@2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.

I've tried with:

import re
import requests

pattern = r'\w+@\w+\.{1}\w+'

urls = (  
    'https://rainforestfarms.org/contact',
    'https://www.auucvancouver.ca/',
    'http://www.bcla.bc.ca/',
    'http://www.palstudiotheatre.com/',
)

def get_email(link,pattern):
    res = requests.get(link)
    email = re.findall(pattern,res.text)
    if email:
        return link,email[0]
    else:
        return link

if __name__ == '__main__':
    for link in urls:
        print(get_email(link,pattern))

Output I'm getting:

('https://rainforestfarms.org/contact', 'rainforestfarmsllc@gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1@2x.png')
('http://www.palstudiotheatre.com/', 'theatre@palvancouver.org')

Output I wish to get:

('https://rainforestfarms.org/contact', 'rainforestfarmsllc@gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre@palvancouver.org')

How can I get rid of unwanted items using regex?

MITHU
  • 253
  • 2
  • 8
  • 28
  • Since you're scraping anyway, couldn't you use BeautifulSoup and limit your regex to `` tags? – Ryuno-Ki Apr 01 '20 at 08:13
  • Another idea: Postprocess the results with a list of known TLDs: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains (perhaps someone published a JSON structure for that already) – Ryuno-Ki Apr 01 '20 at 08:15
  • Thanks for your suggestion @Ryuno-Ki. It's not that tough to isolate the undesirable items using conditional statement but that is not what I'm after. Given that I would like to get rid of the unwanted items in the first place. – MITHU Apr 01 '20 at 08:21
  • 1
    Well, let me quote: „Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.” (source http://regex.info/blog/2006-09-15/247 ). A regular expression can get tricky. I learned that doing multiple steps is turning out to be more robust. – Ryuno-Ki Apr 01 '20 at 08:23

1 Answers1

1

It depends what you means by "unwanted".

One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.

import re
import requests

pattern = r'\w+@\w+\.(?:com|org)'

urls = (
    'https://rainforestfarms.org/contact',
    'https://www.auucvancouver.ca/',
    'http://www.bcla.bc.ca/',
    'http://www.palstudiotheatre.com/',
)

def get_email(link,pattern):
    res = requests.get(link)
    email = re.findall(pattern, res.text)
    if email:
        return link, email[0]
    else:
        return link

for link in urls:
    print(get_email(link,pattern))

yields

('https://rainforestfarms.org/contact', 'rainforestfarmsllc@gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre@palvancouver.org')

You could obviously do more complex things such as blacklists or regex patterns for the suffix.

As always for this kind of question I strongly recommend using regex101 to check and understand your regex.

smarie
  • 2,470
  • 15
  • 24
  • Yeah, your answer serves the purpose for the given urls. However, I've got one question on this. When you use `(?:com|org)` ,it only grabs those items ending with `com` or `org`. How can I do the opposite, meaning the pattern will ignore those ending with `com` or `org`? Thanks. – MITHU Apr 01 '20 at 09:17
  • You mean a blacklist. You can check here: https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word. Otherwise you can do a two-pass (one regex like your original one, capturing a little "too much" + another regex used on the results which would detect the ones not to take into account) – smarie Apr 01 '20 at 09:28