0

Hope you are all well! I'm new and using Python 2.7! I'm tring to extract emails from a public available directory website that does not seems to have API: this is the site: http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
, the code stop gathering email where on the page at the bottom where it says "load more"! Here is my code:

import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')

soup  = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a') 
list_new =[]
for tag in tags:
    if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))

for x in list_new:
    file_handler.write('%s\n'%x)
file_handler.close()

How can i make sure that the code goes till the end of the directory and does not stop where it shows load more? Thanks. Warmest regards

PIMg021
  • 83
  • 1
  • 8
  • I'm guessing everything on the page after "load more" is dynamically loaded using at least some javascript. Beautifulsoup does not execute javascript, so it can't read dynamically loaded content. – Kevin Sep 23 '16 at 19:02
  • Hi Kevin! Thanks for the reply. could you adivse me how to get a work around this problem or if there is any module in python that does that? Thanks – PIMg021 Sep 23 '16 at 19:09
  • This looks like it might be useful: [Web-scraping JavaScript page with Python](http://stackoverflow.com/q/8049520/953482) – Kevin Sep 23 '16 at 19:11
  • https://github.com/niklasb/dryscrape – Bin Ury Sep 23 '16 at 19:21
  • Why are you using a regex!! – Padraic Cunningham Sep 23 '16 at 19:44
  • Hi Padraic! Is so that i can extract the email only from the full file and write it in a spearte file! Is there a better or simpler way to extract all the data without stopping it at the bottom? – PIMg021 Sep 23 '16 at 19:51

1 Answers1

1

You just need to post some data, in particular incrementing group_no to simulate clicking the load more button:

from bs4 import BeautifulSoup
import requests

# you can set whatever here to influence the results
data = {"group_no": "1",
        "search": "category",
        "segment": "",
        "activity": "",
        "retail": "",
        "category": "",
        "Bpark": "",
        "alpha": ""} 

post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"

with requests.Session() as s:
    soup = BeautifulSoup(
        s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
        "html.parser")
    print([a["href"] for a in soup.select("a[href^=mailto:]")])
    for i in range(1, 5):
        data["group_no"] = str(i)
        soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
        print([a["href"] for a in soup.select("a[href^=mailto:]")])

To go until the end, you can loop until the post returns no html, that signifies we cannot load any more pages:

def yield_all_mails():
    data = {"group_no": "1",
            "search": "category",
            "segment": "",
            "activity": "",
            "retail": "",
            "category": "",
            "Bpark": "",
            "alpha": ""}

    post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
    start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
    with requests.Session() as s:
        resp = s.get(start)
        soup = BeautifulSoup(s.get(start).content, "html.parser")
        yield (a["href"] for a in soup.select("a[href^=mailto:]"))
        i = 1
        while resp.content.strip():
            data["group_no"] = str(i)
            resp = s.post(post, data=data)
            soup = BeautifulSoup(resp.content, "html.parser")
            yield (a["href"] for a in soup.select("a[href^=mailto:]"))
            i += 1

So if we ran the function like below setting "alpha": "Z" to just iterate over the Z's:

from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
    print(mail)

We would get:

mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com

Process finished with exit code 0

You should put a sleep in between requests so you don't hammer the server and get yourself blocked.

Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
  • Hi Padraic! Thanks as always for the help! I tried the code: However it first gives an empty list and rightafter a traceback in line 26 Nonetype object is not callable – PIMg021 Sep 23 '16 at 20:04
  • @PIMg021, what you see in the question output is what you should get, I have run the code myself. You must be doing something differently. What version of both requests and bs4 are you using? – Padraic Cunningham Sep 23 '16 at 20:11
  • Hi Padraic! I downloed bs4 and request yesterday, so i guess they are the latest. where do i need to insert the alpha and z to test it! also the from itertool import chain part must be placed directly below the previous code? – PIMg021 Sep 23 '16 at 20:24
  • Yep, run it exactly as posted. The z is what letter to search under, if you leave it blank it will go from a-z, obviously that would take quite a while so I just used Z – Padraic Cunningham Sep 23 '16 at 20:30
  • If i run the first part of the code it comes back with error at line 24 the on with print.... saying that NoneType object is not callable! – PIMg021 Sep 23 '16 at 20:35
  • Hi Padraic , i posted a new question on this problem! i'm starting to get an hang on how to use propoerly stackpverflow:) : http://stackoverflow.com/questions/39669397/python-2-7-beautifulsoup-error-type-nonetype-object-not-callable – PIMg021 Sep 23 '16 at 20:55
  • Hi Padric, i managed to run it, however it stops at letter K! any suggestion? Thanks. Regards – PIMg021 Sep 24 '16 at 15:38
  • @PIMg021, you may need to put a sleep in and/or use a new `request.post(..)` in place of the Session object. – Padraic Cunningham Sep 24 '16 at 15:45